-
Automatic Construction Of Robust Spherical Harmonic
Subspaces
Patrick Snape Yannis Panagakis Stefanos ZafeiriouImperial
College London
{p.snape,i.panagakis,s.zafeiriou}@imperial.ac.uk
Abstract
In this paper we propose a method to automaticallyrecover a
class specific low dimensional spherical har-monic basis from a set
of in-the-wild facial images. Wecombine existing techniques for
uncalibrated photomet-ric stereo and low rank matrix decompositions
in orderto robustly recover a combined model of shape and
iden-tity. We build this basis without aid from a 3D modeland show
how it can be combined with recent efficientsparse facial feature
localisation techniques to recoverdense 3D facial shape. Unlike
previous works in thearea, our method is very efficient and is an
order ofmagnitude faster to train, taking only a few minutes
tobuild a model with over 2000 images. Furthermore, itcan be used
for real-time recovery of facial shape.
1. Introduction
The recovery of 3D shape from images representsan ill-posed and
challenging problem. In its most dif-ficult form, this involves
recovering a representationof shape for an object from a single
image, under ar-bitrary illumination. However, for any given
image,there are an infinite number of shape, illumination
andreflectance inputs that can reproduce the image [1].Therefore,
shape recovery is commonly performed byrelaxing the problem by
introducing prior informationor by adding constraints. The most
impressive resultshave been achieved by restricting the problem
spaceto a single class of objects such as faces. For exam-ple,
Blanz and Vetter’s 3D morphable model (3DMM)[7] is one of the most
well-known shape recovery tech-niques and concentrates on the
recovery of facial shape.3DMMs constrain their reconstruction
capabilities tolying within the span of a linear combination of
faces.This allows for the synthesis of a large range of novelfaces.
However, the major drawback of 3DMMs is theircomplexity of
construction. Morphable models requirea set of high quality 3D
meshes and associated textures.Currently, collecting these meshes
is a time consuming
(a) (b)
Figure 1: An example reconstruction. Given theinput image (a)
our algorithm can robustly recoverdense 3D shape using only
images.
and expensive process involving specialised hardwareand manual
guidance. Once the meshes have been col-lected, they must be placed
into correspondence whichis a complex research issue in its own
right.
In this paper, we look to borrow from ideas seenwithin the
photometric stereo literature in order to re-cover shape from
objects under unconstrained settingsusing only a set of images.
Typically, these types of un-constrained photo collections are
called ”in-the-wild”.We seek to construct our models in an
automatic man-ner, without manual feature point placement or
carefulselection of the input images.
In particular, we seek to recover the shape of the ob-ject by
exploiting the similarity within the object class.In the case of
faces, there are millions of available im-ages that can be utilised
to build in-the-wild models.However, recovering shape from these
images is incred-ibly challenging, as they have been captured in
com-pletely unconstrained conditions. No knowledge of thelighting
conditions, the facial location or the camerageometric properties
are provided with the images. Toaddress these problems, we propose
to recover a classspecific spherical harmonic (SH) basis that
exploits thelow-rank structure of faces [5, 16]. Spherical
harmonics
1
-
are ideal for this purpose as they can be approximatedby a
low-dimensional linear subspace [5, 38]. By usingthe first order
SH, 87.5% of the low-frequency compo-nent of the lighting is
approximated. The first order SHcan then be used to recover 3D
shape as their discreteapproximation directly incorporates the
normals of theobject. These normals can be integrated to provide
adense 3D surface [14].
Since we seek to recover a SH subspace, we requirecorrespondence
between our input images. This isachieved by locating a set of
sparse features on the facesand then warping them into a single
common refer-ence frame. This method of achieving correspondenceis
powerful, as recent facial feature localisation tech-niques have
incredibly low overhead [18, 39] and thuscause training to be
efficient. The secondary benefit ofthis coarse alignment is that
our basis can be coupledwith existing facial alignment such as
Active Appear-ance Models (AAM) [13, 34] in order to provide
anappearance basis. We show that our recovered SH ba-sis can be
robustly learnt from automatically aligned,in-the-wild images. The
basis can be used to recoverboth dense shape of generic faces and
as a person spe-cific appearance prior within AAM type
algorithms.
Summarising, our contributions are:
1. We show the advantage of using a coarser align-ment than
optical flow for model construction. Inparticular, our training
time for 2330 images fromthe HELEN dataset [25] is approximately 12
min-utes. We strongly believe that leveraging largenumbers of
images is important to build expres-sive models and thus training
time is an importantconsideration.
2. A formal mathematical framework for perform-ing efficient
class specific uncalibrated photometricstereo using low-rank and
sparsity constraints.
3. We show how our model can be coupled with exist-ing facial
alignment algorithms in order to providelow frequency dense shape
for in-the-wild images.
2. Related Work
In the literature, there are many techniques that at-tempt to
recover 3D facial shape from single images[7, 47, 30, 31, 28, 20,
44]. The most influential ofthese works was the 3D Morphable Model
(3DMM)proposed in [7]. The 3DMM can produce very
realisticreconstructions but has the disadvantage of having
acomplex model construction and fitting process. Thisreliance on
accurate 3D meshes means that 3DMMsoften suffer from an inability
to recover complex facialattributes such as expression. Expression
in dense 3D
models has been addressed in the area of blendshapes[11, 10,
48], however these blendshapes are still com-plex to create as they
require hundreds of meshes ofindividuals under varying
expressions.
More general techniques for shape recovery such asthe work of
Barron et al . [3] do not perform well forinherently non-lambertian
objects such as faces. How-ever, shape-from-shading (SFS) has been
shown to re-cover accurate facial shape by assuming a prior on
theshape of faces [44, 20, 30, 29, 31, 17, 21]. In contrast toour
proposal, SFS techniques rely on recovery of shapefrom a single
image, whereas we consider large collec-tions of images.
The most relevant techniques to this paper involverecovering
shape from a collection of images undervarying illumination.
Typically, this involves solvingsome form of uncalibrated
photometric stereo problem[4, 36, 35]. However, traditional
uncalibrated photo-metric stereo techniques still assume that the
imagesprovided have been captured by a photometric stereosystem
under explicit directed lighting. The relaxationof the uncalibrated
photometric stereo problem to aclass of objects further increases
the ambiguity inher-ent within the problem. Specifically, it is now
neces-sary to separate the SH lighting from the identity ofthe
individuals. This problem has been approachedfor both shape
recovery and facial recognition pur-poses [27, 26, 30, 29, 51]. Lee
et al . [27, 26] recoverfacial shape by separating illumination
from identityin a manner that is similar to 3DMMs. Minsik et al
.separate [30, 29] the appearance and identity via a lowrank tensor
decomposition that provides a very efficientreconstruction
methodology. However, both Lee et al .and Minsik et al . still rely
on previously built dense 3Dmodels to perform their
decomposition.
Recently, Kemelmacher-Shlizerman [19] proposed amethod for
building morphable models from images offaces downloaded from the
Internet. This work sharessimilarities with ours in that it
attempts to build a sub-space that explicitly separates shape and
appearance.However, in [19] they do not investigate a robust
de-composition, but instead rely on a time consuming op-tical flow
[22] based registration process to remove out-liers from the
images. Although this methodology al-lows for expression transfer,
it does not allow the recov-ered shapes to be used within existing
facial alignmenttechniques such as Active Appearance Models
(AAMs).In contrast, our use of efficient facial alignment
tech-niques to acquire correspondence substantially reducesour
training time. It also allows our recovered basisto be coupled with
the alignment techniques for simul-taneous facial landmark
localisation and dense surfacerecovery. However, the coarse
geometric alignment we
-
employ is more sensitive to corruptions such as occlu-sions and
extreme facial pose. For this reason, we em-ploy a low rank
constraint [9, 37, 41, 12, 49, 32] tohelp remove these high
frequency errors whilst main-taining the low frequency lighting
variations. Althoughwe share a similar optimisation framework to
other ro-bust principal component analysis problems such as[9, 37,
49, 32], we are the first to propose a low-rank de-composition that
recovers a subspace of spherical har-monics.
3. Problem Formulation
In this section we describe how a spherical harmonic(SH) basis
can be recovered using uncalibrated photo-metric stereo (PS)
techniques. We then describe howthis problem generalises to a
multi-person dataset andhow a representation of shape can be
recovered perimage. Finally, we discuss the importance of
achievingcorrespondence between the images in an efficient
andscalable manner.
3.1. Spherical Harmonic Bases
The lambertian reflectance model states that mattematerials
reflect light uniformly in all directions. Thissimple image
formation model assumes that the inten-sity of light reflecting
from a surface is a function of theshape of the surface and a
linear combination of pointlight sources. More formally, given an
image I(x, y),the intensity at a given pixel (x, y) of a convex
lam-bertian surface illuminated by a single light, can beexpressed
as
I(x, y) = ρ(x, y)lTn(x, y), (1)
where ρ(x, y) is the albedo at the pixel and representssurface
reflectivity, l is the vector denoting the singlepoint light source
illuminating the object and n(x, y)is the surface normal at the
pixel.
If we now consider a collection of directional lightsources
placed at infinity, the lighting intensity at agiven pixel can be
expressed as a non-negative functionof the unit sphere using a sum
of spherical harmonics.Formally,
I(x, y) =
∞∑n=0
n∑m=−n
αn `nm ρ(x, y) Ynm(n(x, y)), (2)
where αn = π, 2π/3, π/4, . . ., `nm are the coeffi-cients of the
harmonic expansion of the lighting andYnm(n(x, y)) are the surface
SH functions evaluated atthe surface normal, n(x, y). As n→∞, the
coefficientstend to zero, and thus the SH can be accurately
rep-resented by the lower order harmonics. In [15], it was
shown that the first order SH function is guaranteedto represent
at least 87.5% of the reflectance and ex-perimentally verified to
recover up to 95% in the caseof faces. The first order SH expansion
is also directlyrelated to the objects surface normals:
Y(n(x, y)) = ρ(x, y)[1,nx(x, y),ny(x, y),nz(x, y)]T ,(3)
where ni(x, y) denotes the ith component of the nor-mal vector.
This is a particularly useful result as re-covering the first order
SH means directly recovering arepresentation of shape for an
object.
3.2. Uncalibrated Photometric Stereo
Classical photometric stereo (PS) seeks to recoverthe normals of
a convex object given a number of im-ages under known different
lighting with known di-rections. Traditionally, the following
decomposition isperformed
X = NL̃, (4)
where X ∈ Rd×n is the matrix of observations, andeach of the n
columns represents a vectorised imageof the object with d total
pixels, N ∈ Rd×3 containsthe normal at every pixel and L̃ ∈ R3×n is
the matrixof lighting vectors per image. Assuming accurate
lightvectors and no shadowing artifacts, this problem is triv-ially
solved as a linear least squares problem. Photo-metric stereo has
been shown to provide accurate fa-cial reconstructions despite
faces not representing truelambertian objects. For example, there
are many pub-licly available facial PS datasets such as the
PhotofaceDatabase [50] and the Yale B [16] dataset.
If the lighting vectors are inaccurate or unknown,then PS is
said to be uncalibrated. In [4], Basri etal . showed that X can be
decomposed via a rank con-strained singular value decomposition
(SVD) to recoverthe SH bases and the lighting coefficients in the
un-calibrated setting. First order SH are recovered by arank 4 SVD
and are accurate up to a 4× 4 generalisedLorentz transformation. By
enforcing constraints suchas the integrability constraint [14], the
first order SH,and thus the normals, can be recovered up to a
gen-eralised bas relief ambiguity (GBR). Formally, uncali-brated PS
looks to recover
X = BL, (5)
where X is as before, B ∈ Rd×4 contains the first or-der SH
basis images and L ∈ R4×n is the matrix oflighting coefficients. As
previously mentioned, the so-lution to this problem is found by
performing an SVD,X = UΣVT , B = U
√Σ and L =
√ΣVT . Uncali-
brated PS is useful as it is not always possible to re-cover
accurate lighting estimations for every image.
-
3.3. Class Specific Uncalibrated Photometric Stereo
A generalisation of the uncalibrated PS problem fora specific
class involves recovering a joint basis of ap-pearance and
illumination. In the case of SH for faces,this means attempting to
separate the identity of theindividual from their surface normals.
This problem isa classic example of a bilinear decomposition
problemand has been previously studied for use in 3D
surfacerecovery [51, 30, 28, 27, 19]. In the case of SH, we seekto
recover a low dimensional linear subspace that canrecover normals
for multiple individuals. This subspaceimplies that a face can be
accurately reconstructed us-ing a linear combination of basis
shapes. This assump-tion is commonly employed in algorithms such as
the3DMM and AAMs. Assuming that we want to recoverk such components
for our shape subspace, and thatwe are using the first order SH, we
will recover a d×4kbasis matrix that allows us to recover 3D facial
shapefor multiple individuals. Formally,
X = B(L ∗C), (6)
where B ∈ Rd×4k is the linear basis, L ∈ R4×n is thematrix of
first order SH lighting coefficients, C ∈ Rk×nis the matrix of
shape coefficients and (· ∗ ·) denotesthe Khatri-Rao product[23].
In fact, this is the ex-act decomposition problem solved by
Kemelmacher-Shlizerman in [19] where they denote the
combinedcoefficients matrix as P = L ∗ C. This was
partiallyrecognised by Zhou et al . [51], however they recover
thelighting and shape coefficient separately by iterativelysolving
for each in an alternating fashion. Zhou et al .also do not provide
any examples of the quality of theshape estimate that they
recover.
Minsik et al . [30, 28] also attempt this decomposi-tion by
posing the problem in the form of a tensor.The decomposition can
then be solved by applying amultilinear SVD. However, multilinear
SVD requires atensor representation and thus these techniques
requireprior data to recover results. A tensor representation
isuseful, however, for illustrating how to recover the d×4first
order SH for an individual, given their coefficientsvector ci ∈
Rk×1. We reshape the basis matrix B as atensor which we denote S ∈
Rd×k×4. The tensor prod-uct along the second mode, S×2 ci, recovers
the personspecific shape of the ith column of X. To recover Bfrom
S, we perform matricisation of S along the firstmode, denoted S(1),
to yield S(1) = B ∈ Rd×4k.
The problem given in (6) can now be solved withinan optimisation
framework, which we examine in detailin the next section.
3.4. Robust Construction Of Spherical HarmonicBases
Inspired by recent advances in robust low-rank sub-space
recovery [9], we seek to modify Equation 6 toinclude new
constraints that impose robustness. Asmentioned previously, faces
can be accurately recon-structed by a linear combination of faces
taken froma low-dimensional basis. Therefore, we propose to
de-compose the image matrix into a low-rank part (A)capturing the
low frequency shape information and asparse part (E) accounting for
gross but sparse noisesuch as partial occlusions and pixel
corruptions. Topromote low-rank and sparsity the nuclear norm
(de-note by ‖·‖∗) and the `1-norm (denote by ‖·‖1) areemployed,
respectively. Formally we propose to solvethe following non-convex
optimisation problem:
argminA,E,B,L,C
‖A‖∗ + λ‖E‖1 +µ
2‖A−B(L ∗C)‖2F
subject to X = A + E, BTB = I. (7)
Although the above problem is non-convex, an accuratesolution
can be obtained by employing the AlternatingDirections Method (ADM)
[6]. That is, to minimisethe following augmented Lagrangian
function:
L(A,E,B,C,L,Y) =
‖A‖∗ + λ‖E‖1 +µ
2‖A−B(L ∗C)‖2F+
tr (YT (X−A−E)) + µ2‖X−A−E‖2F ,
(8)
with respect to BTB = I. Let t denote the iterationindex. Given
A[t], E[t], B[t], C[t], L[t], Y[t] and µ[t],the iteration of ADM
for Equation 7 reads:
A[t+1] = argminA[t]
L(A[t],E[t],B[t],C[t],L[t],Y[t])
= ‖A[t]‖∗ +µ[t]
2
(‖A[t] −B[t](L[t] ∗C[t])‖2F+
‖X−A[t] −E[t] +Y[t]
µ[t]‖2F), (9)
E[t+1] = argminE[t]
λ‖E[t]‖1+
µ[t]
2‖X−A[t+1] −E[t] +
Y[t]
µ[t]‖2F , (10)
B[t+1] = argminBT
[t]B[t]=I
µ[t]
2‖A[t+1] −B[t](L[t] ∗C[t])‖2F ,
(11)[L[t+1],C[t+1]
]=
argminL[t],C[t]
µ[t]
2‖A[t+1] −B[t+1](L[t] ∗C[t])‖2F . (12)
-
Subproblem (9) admits a closed-form solution, givenby the
singular value thresholding (SVT)[8] operatoras:
A[t+1] = Dµ−1[t]
[M[t] −A[t] + X−E[t] +
Y[t]
µ[t]
], (13)
where M[t] = B[t](L[t] ∗C[t]) is introduced for brevityof the
equation and the SVT is defined as Dτ (Q) =USτV
T for any matrix Q with SVD: Q = USVT .Subproblem (10) has a
unique solution that is obtainedvia the elementwise shrinkage
operator [9]. The shrink-age operator is defined as Sτ [q] = sgn(q)
max(|q|−τ, 0).Therefore, the solution of (10) is
E[t+1] = Sλµ−1[t]
[X−A[t+1] +
Y[t]
µ[t]
]. (14)
Subproblem (11) is a reduced rank Procrustes Rota-tion problem
[52]. Its solution is given by B[t] = UV
>
with
A[t+1](L[t] ∗C[t]
)>= UΣV>, (15)
being the SVD of A[t+1](L[t] ∗C[t]
)>. However, due
the unitary invariance of the Frobenius norm, Equa-tion 12
becomes
argminL[t+1],C[t+1]
‖BT[t+1]A[t+1] − L[t] ∗C[t]‖2F . (16)
Subproblem (12, 16) is a least squares factorisation of
aKhatri-Rao product [40], which is solved as follows: LetQ =
B[t+1]
>A[t+1],L = L[t], and C = C[t]. Further-more, let qi, li, and
ci be the ith columns of matricesQ,L, and C, respectively. Clearly
qi = li ⊕ ci, where⊕ denotes the Kronecker product. For each column
ofQ: Reshape qi into a matrix Q̃i ∈ Rl×k×N such thatvec(Q̃i
)= qi. Obviously, Q̃i = ci · l>i is a rank-one
matrix. Compute the SVD of Q̃i as Q̃i = UiΣiV>i .
The best rank-one approximation of Q̃i is obtained bytruncating
the SVD as: li = ui
√σ1 and ci =
√σ1vi,
where ui and vi are the first column vectors of Ui andVi,
respectively, and σ1 is the largest singular value.The ADM for
solving (7) is outlined in Algorithm 1.
It is important to note that there are inherent am-biguities in
this decomposition, both from the SVD torecover B and in the
Khatri-Rao factorisation to re-cover L and C. In particular, we are
most concernedabout how they may affect the recovered normals
be-fore we integrate them to recover depth. In order toresolve
these ambiguities, we take the simplest possibleapproach, we
recover the ambiguity matrix from a tem-plate set of normals
provided by a known mean face.
Algorithm 1 Solving (7) by the ADM method.
Input: Data Matrix X ∈ Rd×n and parameter λ.Output: Matrices A,
E, B, C, L.
1: Initialise: A[0] = 0, E[0] = 0, B[0] = 0, C[0] = 0, L[0] =
0,Y[0] = 0, µ[0] = 10
−6, ρ = 1.1, � = 10−8
2: while not converged do do3: Fix E[t], B[t], C[t], L[t] and
update A[t+1] by
A[t+1] = Dµ−1[t]
[B[t](L[t] ∗C[t])−A[t] + X− E[t] +
Y[t]
µ[t]
](17)
4: Fix A[t+1], L[t], C[t], L[t] and update A[t+1] by
E[t+1] = Sλµ[t]−1
[X−A[t+1] +
Y[t]
µ[t]
](18)
5: Update B[t+1] by first performing the SVD on:
A[t+1](L[t] ∗C[t])T
= UΣV, B[t+1] = UVT
(19)
6: Update [L[t+1],C[t+1]] via a Least Squares Khatri-Rao
fac-torization, as described in Section 3.4
7: Update Lagrange multipliers by
Y[t+1] = Y[t] + µ[t](X−A[t+1] − E[t+1]
)(20)
8: Update µ[t+1] by µ[t+1] = min(ρµ[t], 106)
9: Check convergence condition
‖X−A[t+1] − E[t+1]‖∞ < �,
‖A[t+1] −B[t+1] − (L[t+1] ∗C[t+1])‖∞ < �(21)
10: t← t+ 111: end while
3.5. Efficient Pixelwise Correspondence
In contrast to the related work of Kemelmacher-Shlizerman [19],
we achieved pixelwise correspondencebetween our images by using
existing, efficient sparsefacial alignment algorithms. This has two
distinct ad-vantages. Firstly, recent facial alignment
algorithmssuch as those by Ren et al . [39] and Kazemi et al .
[18]can produce a very accurate set of sparse facial featuresin the
order of a single millisecond. In contrast, the op-tical flow
method cited in [19] takes multiple secondseven for a small image.
This means that our trainingtime is drastically reduced in
comparison to [19]. Ide-ally, our technique would be able to scale
to the mag-nitude of thousands of images, whereas the alignmentof
[19] would quickly become infeasible as the numberof images
increases. In fact, the optical flow step isrun multiple times as
the collection flow algorithm isused [22] which involves an
iterative algorithm of rank4 decompositions and repeated optical
flow. Secondly,the use of a direct alignment to a single reference
frame
-
(a) (b) (c) (d)
Figure 2: Example of the low rank effect onwarped pose. (a)
initial input image (b) input im-age after warping (c) warped image
after the low rankconstraint (d) recovered depth from (c).
enables the the usage of our basis in existing appear-ance based
facial alignment algorithms such as AAMs.This means that our basis
can be used to reconstructdense 3D shape of faces directly from an
existing AAMfitting provided the reference space of the AAM andour
subspace is the same.
However, it is important to note that there are twopotential
drawbacks to our alignment technique. Thealignment is based on a
Piecewise Affine warping andis thus much coarser than the optical
flow techniqueused in [19]. This is particularly amplified when
largerposes are present in the input images. However, this ispartly
why the low rank component of our algorithmis so important. As
Figure 2 shows, the robust decom-position of the basis helps
correct these large globalerrors so that the shape subspace can be
successfullyrecovered. Secondly, our technique does not contain
anumber of sub-clusters that can be used to warp expres-sion onto
our model. However, by using a large numberof images that contain
expression we directly includeexpression within our subspace. In
[19], the recoveredsubspace will necessarily be devoid of
expression as theglobal reference shape is neutral. This means that
thesubspace recovered by [19] will not be able to recoverexpressive
3D shape using efficient facial alignment al-gorithms.
4. Experiments
In this section we provide a number of experimentsthat emphasise
the increase in robustness of our re-constructions. We also show a
new application to thistype of model that involves improving the
fitting resultsof an AAM using our constructed SH basis. Choos-ing
the number of components, k, to recover is animportant problem that
was not properly addressedby Kemelmacher-Schlizerman in [19]. In
these experi-ments we attempt to recover as many components
aspossible in order to strike a balance between
cleanlyreconstructed normals and identity. However, there isa
trade-off when choosing the value of k. In particular,
Initial Final Recovered Depth
Figure 4: Person specific model fitting for TomHanks. Images of
Tom Hanks coarsely aligned by a fa-cial alignment method. Our
algorithm improves the fa-cial alignment and simultaneously
recovers depth. Im-ages shown are from a YouTube video of Tom
Hanks.
(a) (b) (c)
(d) (e) (f)
Figure 5: Our subspace used for SFS. Normalslearnt automatically
from the SH subspace of HELENvs normals from the clean data of
ICT-3DRFE. (b, e)the clean data (c, f) proposed subspace.
if the value of k is too large, then the decomposition isunable
to separate the identity and shape and the sub-space of shape no
longer represents valid normals. Thisis one of the primary
advantages of our robust decom-position, as it allows the value of
k to be larger giventhe reduced rank of the images. However, a
potentialdisadvantage of our proposed method is the sensitivityof
the algorithm to the parameter λ, which must betuned for every
dataset. It is also important to stressthat our main goal is to
recover the low frequency shapeinformation to provide plausible 3D
facial surfaces un-der challenging conditions. However, in Section
4.3, we
-
Input Proposed [19] Proposed [19]
Figure 3: Comparison with the blind decomposition of [19].
Images from the HELEN[25] dataset.
show that our recovered subspace can be used in exist-ing high
frequency recovery algorithms such as SFS.
The area of 3D facial surface recovery is lacking anyform of
formal quantitative benchmark. The quantita-tive benchmark
presented in [19] is performed on depthdata recovered from
photometric stereo. This is notground truth depth data, as error is
introduced dur-ing integration, and a more accurate evaluation
wouldbe the angular error of the recovered normals. How-ever, in
the presence of cast shadows, even the normalsof photometric stereo
are biased. For this reason, thelack of a standard and fair
quantitative evaluation, wefocus on qualitative results in this
paper.
Specifically we performed the following experiments:(1) We built
our subspace using the HELEN[25]dataset. We directly compare
against the blind de-composition proposed in [19] and show
particularlychallenging images from the dataset. This
experimenthighlights the difficulty in constructing subspaces
fromlarge a set of in-the-wild images. (2) We show that therobust
subspace learnt in (1) can be used within theshape-from-shading
(SFS) framework of Smith et al .[44]. By recovering the normals
from every image ofHELEN, we can perform a secondary principal
compo-nent analysis (PCA) on the normals in order to directlyembed
them within Smith’s algorithm. In this exper-iment, we compare
against a clean dataset of normalsacquired from the ICT-3DRFE[46]
database. (3) Weshow how our subspace can be combined with an
ex-isting facial alignment algorithm, namely project-out
AAMs [34]. Our subspace can be used both as the ap-pearance
basis for the AAM and also as a methodologyof recovering dense 3D
shape.
In the following section we describe the constructionof the
bases and explain what processing was performedon each dataset.
4.1. Constructing The Robust Bases
The process of building the robust SH basis wasthe same for all
datasets involved. Facial annotationsconsisting of 68 points were
recovered through vari-ous methods for each dataset. In the case of
the HE-LEN database, the manual annotations provided bythe IBUG
group were used [42, 43], in the case of theYale B, Photoface and
ICT-3DRFE databases, man-ual annotations were used and the
in-the-wild imagesand video of Tom Hanks were automatically
annotatedby the one millisecond facial alignment method of
[18]provided by the Dlib project [24].
These annotations were then warped via a Piece-wise Affine
transformation to a mean reference shapethat was built from all the
faces, training and testing,of the LFPW facial annotations provided
by IBUG.This provided the dense correspondence required
forperforming matrix decompositions. To construct ourSH bases, we
performed the algorithm as described inSection 3.4 on the warped
images. In order to providethe example reconstructions, the
reconstructed imageswere warped back into their original shapes and
thenintegrated using the method of Frankot and Chellappa
-
[14].Table 1 gives examples of the training time taken
for the in-the-wild Tom Hanks images and the HELENdataset. It is
important to note that part of the rea-son the training time is
much lower for the Tom Hanksimages is that they have an inherently
lower rank thanthe HELEN images as they are all of the same
individ-ual. This greatly affects the convergence time and thusthe
timings do not scale linearly.
Data W1 Train W2 TotHELEN (2330) 8 730 25 763T. Hanks (274) 1 21
4 26
Table 1: Training Times. Mean training times inseconds over 10
runs rounded to the nearest second.’W1’ denotes warping to the LFPW
reference frame of(150 × 150) pixels, ’W2’ denotes warping back to
theoriginal images and ’Train’ denotes the total trainingtime of
our method described in Section 3.4. Originalimages were larger
than the reference, hence the in-crease from ’W1’ to ’W2’. Timings
were recorded onan Intel Xeon E5-1650 3.20GHz with 32GB of RAM.
4.2. Comparison Using HELEN
In this set of experiments we wished to convey tworesults: (1)
that we are capable of quickly constructingour basis on a large
number of in-the-wild images, (2)that the our robust formulation of
the problem givessuperior performance to the blind decomposition
usedby [19]. In this experiment, k = 200 and the total num-ber of
components was thus 4k = 800. Figure 3 showsthe results from this
experiment. As we can clearly see,on challenging images the blind
decomposition is un-able to separate the appearance from the
illuminationand thus the recovered normals are unable to
recoveryaccurate shape.
4.3. Using The Subspace In SFS
The SFS technique of Smith et al . [44] relies on aPCA basis
constructed from normals of a single classof object. It then seeks
to recover the high frequencynormal information directly from the
texture. In orderto create the PCA required by [44], we recovered
spher-ical harmonics for every image in the dataset using
theproposed algorithm. We then computed Kernel-PCA[45] on the
normals recovered from the HELEN imagesand supplied them to [44].
The lighting vector is alsoan input to the algorithm and we recover
it by solvinga least squares problem with the known normals.
In order to provide a comparison for our reconstruc-tion, we
created a clean normal subspace using the data
from the ICT-3DRFE [46] database. This database isprimarily use
for image relighting purposes, however,they provide a very accurate
set of normals of facesunder a wide range of expressions. The
results of thisexperiment are shown in Figure 5. Although our
sub-space did not provide reconstructions that are as visu-ally
accurate as the subspace from ICT-3DRFE, theywere still able to
successfully recover a plausible repre-sentation of the high
frequency shading information.
4.4. Automatic Alignment
In this experiment we used the Active TemplateModel (ATM)
provided by the Menpo project [2] inorder to perform a project-out
type algorithm to alignimages of Tom Hanks. This model is similar
to theLucas-Kanade [33] method but uses a point distribu-tion model
(PDM) in order to perform non-rigid align-ment between the images.
In particular, the templateimage is fixed during optimisation of
the PDM, and weuse our subspace to provide a texture representing
anapproximation of the diffuse component of the image.This is
essentially identical to the procedure performedwithin a
project-out AAM.
We used a person specific SH subspace that was builton images of
Tom Hanks that were downloaded auto-matically from the Internet. In
this case, the imageswere automatically aligned using the DLib
implemen-tation of [18]. For this experiment, k = 30 and thusthe
total number of components 4k = 120. We down-loaded 200 frames from
a Youtube video of Tom Hanks1
and attempted to automatically align them using oursubspace and
the ATM. The ATM was initialised us-ing another fitting of [18]
which was then iterativelyimproved. At each global iteration, we
recovered anew set of diffuse textures for each frame and
thenperformed a refitting of every frame. This caused theimages to
align over a sequence of iterations. We per-formed 10 such
iterations. Figure 4 shows two exampleframes where the alignment
was improved and denseshape was also recovered.
5. Conclusion
We have proposed a robust method for automati-cally constructing
generalised spherical harmonic sub-spaces. In particular, we have
shown that by using acommon reference frame as defined in
algorithms suchas AAMs, we can efficiently build models that
haveapplications in shape recovery and facial alignment.
1https://www.youtube.com/watch?v=nFvASiMTDz0 from3:43
-
Acknowledgements
Patrick Snape is funded by a DTA from ImperialCollege London and
by a Qualcomm Innovation Fel-lowship. Yannis Panagakis is funded by
the ERC underthe FP7 Marie Curie Intra-European Fellowship.
Ste-fanos Zafeiriou is partially supported by the EPSRCproject
EP/J017787/1 (4D-FAB).
References
[1] E. H. Adelson and A. P. Pentland. The perceptionof shading
and reflectance. Perception as Bayesianinference, pages 409–423,
1996.
[2] J. Alabort-i Medina, E. Antonakos, J. Booth,P. Snape, and S.
Zafeiriou. Menpo: A comprehensiveplatform for parametric image
alignment and visualdeformable models. In Proceedings of the ACM
In-ternational Conference on Multimedia, MM ’14, pages679–682, New
York, NY, USA, 2014. ACM.
[3] J. T. Barron and J. Malik. Shape, illumination,
andreflectance from shading. IEEE T-PAMI, 2015.
[4] R. Basri, D. Jacobs, and I. Kemelmacher. Photo-metric stereo
with general, unknown lighting. IJCV,72(3):239–257, 2006.
[5] R. Basri and D. W. Jacobs. Lambertian reflectanceand linear
subspaces. IEEE T-PAMI, 25(2):218–233,2003.
[6] D. P. Bertsekas. Constrained optimization and la-grange
multiplier methods. 1982.
[7] V. Blanz and T. Vetter. A morphable model for thesynthesis
of 3d faces. In SIGGRAPH, pages 187–194,1999.
[8] J.-F. Cai, E. J. Candes, and Z. Shen. A singular
valuethresholding algorithm for matrix completion. SIAMJournal on
Optimization, 20(4):1956–1982, 2010.
[9] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robustprincipal
component analysis? Journal of the ACM,58(3):1–37, 2011.
[10] C. Cao, Y. Weng, S. Lin, and K. Zhou. 3d shape re-gression
for real-time facial animation. TOG, 32(4):1,2013.
[11] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K.
Zhou.Facewarehouse: A 3d facial expression database forvisual
computing. TVCG, 20(3):413–425, 2014.
[12] X. Cheng, S. Sridharan, J. Saragih, and S. Lucey.
Rankminimization across appearance and shape for aam en-semble
fitting. In ICCV, pages 577–584. IEEE, 2013.
[13] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Ac-tive
appearance models. IEEE T-PAMI, 23(6):681–685, 2001.
[14] R. T. Frankot and R. Chellappa. A method for en-forcing
integrability in shape from shading algorithms.IEEE T-PAMI,
10(4):439–451, 1988.
[15] D. Frolova, D. Simakov, and R. Basri. Accuracy ofspherical
harmonic approximations for images of lam-bertian objects under far
and near lighting. In T. Pa-jdla and J. Matas, editors, ECCV,
volume 3021 of
Lecture Notes in Computer Science, pages 574–587.Springer Berlin
Heidelberg, 2004.
[16] A. S. Georghiades, P. N. Belhumeur, and D. J. Krieg-man.
From few to many: illumination cone modelsfor face recognition
under variable lighting and pose.IEEE T-PAMI, 23(6):643–660,
2001.
[17] T. Hassner. Viewing real-world faces in 3d. In ICCV,pages
3607–3614. IEEE, 2013.
[18] V. Kazemi and J. Sullivan. One millisecond face align-ment
with an ensemble of regression trees. In CVPR,pages 1867–1874.
IEEE, 2014.
[19] I. Kemelmacher-Shlizerman. Internet based morphablemodel.
In ICCV, pages 3256–3263. IEEE, 2013.
[20] I. Kemelmacher-Shlizerman and R. Basri. 3d face
re-construction from a single image using a single refer-ence face
shape. IEEE T-PAMI, 33(2):394–405, 2011.
[21] I. Kemelmacher-Shlizerman and S. M. Seitz. Face
re-construction in the wild. In CVPR, pages 1746–1753.IEEE,
2011.
[22] I. Kemelmacher-Shlizerman and S. M. Seitz. Collectionflow.
In CVPR, pages 1792–1799. IEEE, 2012.
[23] C. G. Khatri and C. R. Rao. Solutions to some func-tional
equations and their applications to characteriza-tion of
probability distributions. Sankhya: The IndianJournal of
Statistics, 1968.
[24] D. E. King. Dlib-ml: A machine learning toolkit.Journal of
Machine Learning Research, 10:1755–1758,2009.
[25] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S.
Huang.Interactive facial feature localization. In Image Anal-ysis
and Processing, pages 679–692. Springer BerlinHeidelberg, Berlin,
Heidelberg, 2012.
[26] J. Lee, R. Machiraju, B. Moghaddam, and H.
Pfister.Estimation of 3d faces and illumination from
singlephotographs using a bilinear illumination model.
InEurographics Symposium on Rendering. EurographicsAssociation,
2005.
[27] J. Lee, B. Moghaddam, H. Pfister, and R. Machiraju.A
bilinear illumination model for robust face recogni-tion. In ICCV,
pages 1177–1184. IEEE, 2005.
[28] M. Lee and C. H. Choi. Fast facial shape recoveryfrom a
single image with general, unknown lightingby using tensor
representation. Pattern Recognition,44(7):1487–1496, 2011.
[29] M. Lee and C.-H. Choi. A robust real-time algorithmfor
facial shape recovery from a single image containingcast shadow
under general, unknown lighting. PatternRecognition, 46(1):38–44,
2013.
[30] M. Lee and C.-H. Choi. Real-time facial shape recoveryfrom
a single image under general, unknown lightingby rank relaxation.
CVIU, 120:59–69, 2014.
[31] Z. Lei, Q. Bai, R. He, and S. Z. Li. Face shape
recoveryfrom a single image using cca mapping between tensorspaces.
In CVPR, pages 1–7, 2008.
[32] F. Lu, Y. Matsushita, I. Sato, T. Okabe, and Y.
Sato.Uncalibrated photometric stereo for unknown
isotropicreflectances. In CVPR, pages 1490–1497. IEEE, 2013.
-
[33] B. D. Lucas and T. Kanade. An iterative image regis-tration
technique with an application to stereo vision.In Proceedings of
the 7th international joint conferenceon Artificial intelligence,
1981.
[34] I. Matthews and S. Baker. Active appearance
modelsrevisited. IJCV, 60(2):135–164, 2004.
[35] T. Papadhimitri and P. Favaro. A new perspectiveon
uncalibrated photometric stereo. In CVPR, pages1474–1481. IEEE,
2013.
[36] T. Papadhimitri and P. Favaro. A closed-form, con-sistent
and robust solution to uncalibrated photomet-ric stereo via local
diffuse reflectance maxima. IJCV,107(2):139–154, 2014.
[37] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma.Rasl:
Robust alignment by sparse and low-rank de-composition for linearly
correlated images. IEEE T-PAMI, 34(11):2233–2246, 2012.
[38] R. Ramamoorthi and P. Hanrahan. On the relation-ship
between radiance and irradiance: determining theillumination from
images of a convex lambertian ob-ject. JOSA, 18(10):2448–2459,
2001.
[39] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at3000
fps via regressing local binary features. In CVPR,pages 1685–1692.
IEEE, 2014.
[40] F. Roemer and M. Haardt. Tensor-based channel esti-mation
and iterative refinements for two-way relayingwith multiple
antennas and spatial reuse. IEEE Trans-actions On Signal
Processing, 58(11):5720–5735, 2010.
[41] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic.Raps:
Robust and efficient automatic construction ofperson-specific
deformable models. In CVPR, pages1789–1796. IEEE, 2014.
[42] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, andM. Pantic.
300 faces in-the-wild challenge: The firstfacial landmark
localization challenge. In ICCV Work-shops, pages 397–403.
IEEE.
[43] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, andM. Pantic. A
semi-automatic methodology for faciallandmark annotation. In CVPR
Workshops, pages896–903. IEEE, 2013.
[44] W. A. P. Smith and E. R. Hancock. Recovering fa-cial shape
using a statistical model of surface normaldirection. IEEE T-PAMI,
28(12):1914–1930, 2006.
[45] P. Snape and S. Zafeiriou. Kernel-pca analysis of sur-face
normals for shape-from-shading. In CVPR, pages1059–1066. IEEE,
2014.
[46] G. Stratou, A. Ghosh, P. Debevec, and L.-P.
Morency.Exploring the effect of illumination on automatic
ex-pression recognition using the ict-3drfe database. Im-age and
Vision Computing, 30(10):728–737, 2012.
[47] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang,and D.
Samaras. Face relighting from a single imageunder arbitrary unknown
lighting conditions. IEEET-PAMI, 31(11):1968–1984, 2009.
[48] T. Weise, S. Bouaziz, H. Li, and M. Pauly.
Realtimeperformance-based facial animation. In SIGGRAPH,SIGGRAPH,
pages 77:1–77:10, New York, NY, USA,2011. ACM.
[49] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang,and Y. Ma.
Robust photometric stereo via low-rank matrix completion and
recovery. In R. Kimmel,R. Klette, and A. Sugimoto, editors, ACCV,
volume6494 of Lecture Notes in Computer Science, pages 703–717.
Springer Berlin Heidelberg, 2011.
[50] S. Zafeiriou, G. A. Atkinson, M. F. Hansen, W. A. P.Smith,
V. Argyriou, M. Petrou, M. L. Smith, and L. N.Smith. Face
recognition and verification using photo-metric stereo: The
photoface database and a compre-hensive evaluation. IEEE
Information Forensics andSecurity, 8(1):121–135, 2013.
[51] S. Zhou, G. Aggarwal, R. Chellappa, and D. Ja-cobs.
Appearance characterization of linear lam-bertian objects,
generalized photometric stereo, andillumination-invariant face
recognition. IEEE T-PAMI, 29(2):230–245, 2007.
[52] H. Zou, T. Hastie, and R. Tibshirani. Sparse princi-pal
component analysis. Journal of computational andgraphical
statistics, 15(2):265–286, 2006.