Top Banner
Automatic Construction Of Robust Spherical Harmonic Subspaces Patrick Snape Yannis Panagakis Stefanos Zafeiriou Imperial College London {p.snape,i.panagakis,s.zafeiriou}@imperial.ac.uk Abstract In this paper we propose a method to automatically recover a class specific low dimensional spherical har- monic basis from a set of in-the-wild facial images. We combine existing techniques for uncalibrated photomet- ric stereo and low rank matrix decompositions in order to robustly recover a combined model of shape and iden- tity. We build this basis without aid from a 3D model and show how it can be combined with recent efficient sparse facial feature localisation techniques to recover dense 3D facial shape. Unlike previous works in the area, our method is very efficient and is an order of magnitude faster to train, taking only a few minutes to build a model with over 2000 images. Furthermore, it can be used for real-time recovery of facial shape. 1. Introduction The recovery of 3D shape from images represents an ill-posed and challenging problem. In its most dif- ficult form, this involves recovering a representation of shape for an object from a single image, under ar- bitrary illumination. However, for any given image, there are an infinite number of shape, illumination and reflectance inputs that can reproduce the image [1]. Therefore, shape recovery is commonly performed by relaxing the problem by introducing prior information or by adding constraints. The most impressive results have been achieved by restricting the problem space to a single class of objects such as faces. For exam- ple, Blanz and Vetter’s 3D morphable model (3DMM) [7] is one of the most well-known shape recovery tech- niques and concentrates on the recovery of facial shape. 3DMMs constrain their reconstruction capabilities to lying within the span of a linear combination of faces. This allows for the synthesis of a large range of novel faces. However, the major drawback of 3DMMs is their complexity of construction. Morphable models require a set of high quality 3D meshes and associated textures. Currently, collecting these meshes is a time consuming (a) (b) Figure 1: An example reconstruction. Given the input image (a) our algorithm can robustly recover dense 3D shape using only images. and expensive process involving specialised hardware and manual guidance. Once the meshes have been col- lected, they must be placed into correspondence which is a complex research issue in its own right. In this paper, we look to borrow from ideas seen within the photometric stereo literature in order to re- cover shape from objects under unconstrained settings using only a set of images. Typically, these types of un- constrained photo collections are called ”in-the-wild”. We seek to construct our models in an automatic man- ner, without manual feature point placement or careful selection of the input images. In particular, we seek to recover the shape of the ob- ject by exploiting the similarity within the object class. In the case of faces, there are millions of available im- ages that can be utilised to build in-the-wild models. However, recovering shape from these images is incred- ibly challenging, as they have been captured in com- pletely unconstrained conditions. No knowledge of the lighting conditions, the facial location or the camera geometric properties are provided with the images. To address these problems, we propose to recover a class specific spherical harmonic (SH) basis that exploits the low-rank structure of faces [5, 16]. Spherical harmonics 1
10

Automatic Construction Of Robust Spherical Harmonic Subspaces · 2015. 5. 26. · Patrick Snape Yannis Panagakis Stefanos Zafeiriou Imperial College London...

Jan 30, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Automatic Construction Of Robust Spherical Harmonic Subspaces

    Patrick Snape Yannis Panagakis Stefanos ZafeiriouImperial College London

    {p.snape,i.panagakis,s.zafeiriou}@imperial.ac.uk

    Abstract

    In this paper we propose a method to automaticallyrecover a class specific low dimensional spherical har-monic basis from a set of in-the-wild facial images. Wecombine existing techniques for uncalibrated photomet-ric stereo and low rank matrix decompositions in orderto robustly recover a combined model of shape and iden-tity. We build this basis without aid from a 3D modeland show how it can be combined with recent efficientsparse facial feature localisation techniques to recoverdense 3D facial shape. Unlike previous works in thearea, our method is very efficient and is an order ofmagnitude faster to train, taking only a few minutes tobuild a model with over 2000 images. Furthermore, itcan be used for real-time recovery of facial shape.

    1. Introduction

    The recovery of 3D shape from images representsan ill-posed and challenging problem. In its most dif-ficult form, this involves recovering a representationof shape for an object from a single image, under ar-bitrary illumination. However, for any given image,there are an infinite number of shape, illumination andreflectance inputs that can reproduce the image [1].Therefore, shape recovery is commonly performed byrelaxing the problem by introducing prior informationor by adding constraints. The most impressive resultshave been achieved by restricting the problem spaceto a single class of objects such as faces. For exam-ple, Blanz and Vetter’s 3D morphable model (3DMM)[7] is one of the most well-known shape recovery tech-niques and concentrates on the recovery of facial shape.3DMMs constrain their reconstruction capabilities tolying within the span of a linear combination of faces.This allows for the synthesis of a large range of novelfaces. However, the major drawback of 3DMMs is theircomplexity of construction. Morphable models requirea set of high quality 3D meshes and associated textures.Currently, collecting these meshes is a time consuming

    (a) (b)

    Figure 1: An example reconstruction. Given theinput image (a) our algorithm can robustly recoverdense 3D shape using only images.

    and expensive process involving specialised hardwareand manual guidance. Once the meshes have been col-lected, they must be placed into correspondence whichis a complex research issue in its own right.

    In this paper, we look to borrow from ideas seenwithin the photometric stereo literature in order to re-cover shape from objects under unconstrained settingsusing only a set of images. Typically, these types of un-constrained photo collections are called ”in-the-wild”.We seek to construct our models in an automatic man-ner, without manual feature point placement or carefulselection of the input images.

    In particular, we seek to recover the shape of the ob-ject by exploiting the similarity within the object class.In the case of faces, there are millions of available im-ages that can be utilised to build in-the-wild models.However, recovering shape from these images is incred-ibly challenging, as they have been captured in com-pletely unconstrained conditions. No knowledge of thelighting conditions, the facial location or the camerageometric properties are provided with the images. Toaddress these problems, we propose to recover a classspecific spherical harmonic (SH) basis that exploits thelow-rank structure of faces [5, 16]. Spherical harmonics

    1

  • are ideal for this purpose as they can be approximatedby a low-dimensional linear subspace [5, 38]. By usingthe first order SH, 87.5% of the low-frequency compo-nent of the lighting is approximated. The first order SHcan then be used to recover 3D shape as their discreteapproximation directly incorporates the normals of theobject. These normals can be integrated to provide adense 3D surface [14].

    Since we seek to recover a SH subspace, we requirecorrespondence between our input images. This isachieved by locating a set of sparse features on the facesand then warping them into a single common refer-ence frame. This method of achieving correspondenceis powerful, as recent facial feature localisation tech-niques have incredibly low overhead [18, 39] and thuscause training to be efficient. The secondary benefit ofthis coarse alignment is that our basis can be coupledwith existing facial alignment such as Active Appear-ance Models (AAM) [13, 34] in order to provide anappearance basis. We show that our recovered SH ba-sis can be robustly learnt from automatically aligned,in-the-wild images. The basis can be used to recoverboth dense shape of generic faces and as a person spe-cific appearance prior within AAM type algorithms.

    Summarising, our contributions are:

    1. We show the advantage of using a coarser align-ment than optical flow for model construction. Inparticular, our training time for 2330 images fromthe HELEN dataset [25] is approximately 12 min-utes. We strongly believe that leveraging largenumbers of images is important to build expres-sive models and thus training time is an importantconsideration.

    2. A formal mathematical framework for perform-ing efficient class specific uncalibrated photometricstereo using low-rank and sparsity constraints.

    3. We show how our model can be coupled with exist-ing facial alignment algorithms in order to providelow frequency dense shape for in-the-wild images.

    2. Related Work

    In the literature, there are many techniques that at-tempt to recover 3D facial shape from single images[7, 47, 30, 31, 28, 20, 44]. The most influential ofthese works was the 3D Morphable Model (3DMM)proposed in [7]. The 3DMM can produce very realisticreconstructions but has the disadvantage of having acomplex model construction and fitting process. Thisreliance on accurate 3D meshes means that 3DMMsoften suffer from an inability to recover complex facialattributes such as expression. Expression in dense 3D

    models has been addressed in the area of blendshapes[11, 10, 48], however these blendshapes are still com-plex to create as they require hundreds of meshes ofindividuals under varying expressions.

    More general techniques for shape recovery such asthe work of Barron et al . [3] do not perform well forinherently non-lambertian objects such as faces. How-ever, shape-from-shading (SFS) has been shown to re-cover accurate facial shape by assuming a prior on theshape of faces [44, 20, 30, 29, 31, 17, 21]. In contrast toour proposal, SFS techniques rely on recovery of shapefrom a single image, whereas we consider large collec-tions of images.

    The most relevant techniques to this paper involverecovering shape from a collection of images undervarying illumination. Typically, this involves solvingsome form of uncalibrated photometric stereo problem[4, 36, 35]. However, traditional uncalibrated photo-metric stereo techniques still assume that the imagesprovided have been captured by a photometric stereosystem under explicit directed lighting. The relaxationof the uncalibrated photometric stereo problem to aclass of objects further increases the ambiguity inher-ent within the problem. Specifically, it is now neces-sary to separate the SH lighting from the identity ofthe individuals. This problem has been approachedfor both shape recovery and facial recognition pur-poses [27, 26, 30, 29, 51]. Lee et al . [27, 26] recoverfacial shape by separating illumination from identityin a manner that is similar to 3DMMs. Minsik et al .separate [30, 29] the appearance and identity via a lowrank tensor decomposition that provides a very efficientreconstruction methodology. However, both Lee et al .and Minsik et al . still rely on previously built dense 3Dmodels to perform their decomposition.

    Recently, Kemelmacher-Shlizerman [19] proposed amethod for building morphable models from images offaces downloaded from the Internet. This work sharessimilarities with ours in that it attempts to build a sub-space that explicitly separates shape and appearance.However, in [19] they do not investigate a robust de-composition, but instead rely on a time consuming op-tical flow [22] based registration process to remove out-liers from the images. Although this methodology al-lows for expression transfer, it does not allow the recov-ered shapes to be used within existing facial alignmenttechniques such as Active Appearance Models (AAMs).In contrast, our use of efficient facial alignment tech-niques to acquire correspondence substantially reducesour training time. It also allows our recovered basisto be coupled with the alignment techniques for simul-taneous facial landmark localisation and dense surfacerecovery. However, the coarse geometric alignment we

  • employ is more sensitive to corruptions such as occlu-sions and extreme facial pose. For this reason, we em-ploy a low rank constraint [9, 37, 41, 12, 49, 32] tohelp remove these high frequency errors whilst main-taining the low frequency lighting variations. Althoughwe share a similar optimisation framework to other ro-bust principal component analysis problems such as[9, 37, 49, 32], we are the first to propose a low-rank de-composition that recovers a subspace of spherical har-monics.

    3. Problem Formulation

    In this section we describe how a spherical harmonic(SH) basis can be recovered using uncalibrated photo-metric stereo (PS) techniques. We then describe howthis problem generalises to a multi-person dataset andhow a representation of shape can be recovered perimage. Finally, we discuss the importance of achievingcorrespondence between the images in an efficient andscalable manner.

    3.1. Spherical Harmonic Bases

    The lambertian reflectance model states that mattematerials reflect light uniformly in all directions. Thissimple image formation model assumes that the inten-sity of light reflecting from a surface is a function of theshape of the surface and a linear combination of pointlight sources. More formally, given an image I(x, y),the intensity at a given pixel (x, y) of a convex lam-bertian surface illuminated by a single light, can beexpressed as

    I(x, y) = ρ(x, y)lTn(x, y), (1)

    where ρ(x, y) is the albedo at the pixel and representssurface reflectivity, l is the vector denoting the singlepoint light source illuminating the object and n(x, y)is the surface normal at the pixel.

    If we now consider a collection of directional lightsources placed at infinity, the lighting intensity at agiven pixel can be expressed as a non-negative functionof the unit sphere using a sum of spherical harmonics.Formally,

    I(x, y) =

    ∞∑n=0

    n∑m=−n

    αn `nm ρ(x, y) Ynm(n(x, y)), (2)

    where αn = π, 2π/3, π/4, . . ., `nm are the coeffi-cients of the harmonic expansion of the lighting andYnm(n(x, y)) are the surface SH functions evaluated atthe surface normal, n(x, y). As n→∞, the coefficientstend to zero, and thus the SH can be accurately rep-resented by the lower order harmonics. In [15], it was

    shown that the first order SH function is guaranteedto represent at least 87.5% of the reflectance and ex-perimentally verified to recover up to 95% in the caseof faces. The first order SH expansion is also directlyrelated to the objects surface normals:

    Y(n(x, y)) = ρ(x, y)[1,nx(x, y),ny(x, y),nz(x, y)]T ,(3)

    where ni(x, y) denotes the ith component of the nor-mal vector. This is a particularly useful result as re-covering the first order SH means directly recovering arepresentation of shape for an object.

    3.2. Uncalibrated Photometric Stereo

    Classical photometric stereo (PS) seeks to recoverthe normals of a convex object given a number of im-ages under known different lighting with known di-rections. Traditionally, the following decomposition isperformed

    X = NL̃, (4)

    where X ∈ Rd×n is the matrix of observations, andeach of the n columns represents a vectorised imageof the object with d total pixels, N ∈ Rd×3 containsthe normal at every pixel and L̃ ∈ R3×n is the matrixof lighting vectors per image. Assuming accurate lightvectors and no shadowing artifacts, this problem is triv-ially solved as a linear least squares problem. Photo-metric stereo has been shown to provide accurate fa-cial reconstructions despite faces not representing truelambertian objects. For example, there are many pub-licly available facial PS datasets such as the PhotofaceDatabase [50] and the Yale B [16] dataset.

    If the lighting vectors are inaccurate or unknown,then PS is said to be uncalibrated. In [4], Basri etal . showed that X can be decomposed via a rank con-strained singular value decomposition (SVD) to recoverthe SH bases and the lighting coefficients in the un-calibrated setting. First order SH are recovered by arank 4 SVD and are accurate up to a 4× 4 generalisedLorentz transformation. By enforcing constraints suchas the integrability constraint [14], the first order SH,and thus the normals, can be recovered up to a gen-eralised bas relief ambiguity (GBR). Formally, uncali-brated PS looks to recover

    X = BL, (5)

    where X is as before, B ∈ Rd×4 contains the first or-der SH basis images and L ∈ R4×n is the matrix oflighting coefficients. As previously mentioned, the so-lution to this problem is found by performing an SVD,X = UΣVT , B = U

    √Σ and L =

    √ΣVT . Uncali-

    brated PS is useful as it is not always possible to re-cover accurate lighting estimations for every image.

  • 3.3. Class Specific Uncalibrated Photometric Stereo

    A generalisation of the uncalibrated PS problem fora specific class involves recovering a joint basis of ap-pearance and illumination. In the case of SH for faces,this means attempting to separate the identity of theindividual from their surface normals. This problem isa classic example of a bilinear decomposition problemand has been previously studied for use in 3D surfacerecovery [51, 30, 28, 27, 19]. In the case of SH, we seekto recover a low dimensional linear subspace that canrecover normals for multiple individuals. This subspaceimplies that a face can be accurately reconstructed us-ing a linear combination of basis shapes. This assump-tion is commonly employed in algorithms such as the3DMM and AAMs. Assuming that we want to recoverk such components for our shape subspace, and thatwe are using the first order SH, we will recover a d×4kbasis matrix that allows us to recover 3D facial shapefor multiple individuals. Formally,

    X = B(L ∗C), (6)

    where B ∈ Rd×4k is the linear basis, L ∈ R4×n is thematrix of first order SH lighting coefficients, C ∈ Rk×nis the matrix of shape coefficients and (· ∗ ·) denotesthe Khatri-Rao product[23]. In fact, this is the ex-act decomposition problem solved by Kemelmacher-Shlizerman in [19] where they denote the combinedcoefficients matrix as P = L ∗ C. This was partiallyrecognised by Zhou et al . [51], however they recover thelighting and shape coefficient separately by iterativelysolving for each in an alternating fashion. Zhou et al .also do not provide any examples of the quality of theshape estimate that they recover.

    Minsik et al . [30, 28] also attempt this decomposi-tion by posing the problem in the form of a tensor.The decomposition can then be solved by applying amultilinear SVD. However, multilinear SVD requires atensor representation and thus these techniques requireprior data to recover results. A tensor representation isuseful, however, for illustrating how to recover the d×4first order SH for an individual, given their coefficientsvector ci ∈ Rk×1. We reshape the basis matrix B as atensor which we denote S ∈ Rd×k×4. The tensor prod-uct along the second mode, S×2 ci, recovers the personspecific shape of the ith column of X. To recover Bfrom S, we perform matricisation of S along the firstmode, denoted S(1), to yield S(1) = B ∈ Rd×4k.

    The problem given in (6) can now be solved withinan optimisation framework, which we examine in detailin the next section.

    3.4. Robust Construction Of Spherical HarmonicBases

    Inspired by recent advances in robust low-rank sub-space recovery [9], we seek to modify Equation 6 toinclude new constraints that impose robustness. Asmentioned previously, faces can be accurately recon-structed by a linear combination of faces taken froma low-dimensional basis. Therefore, we propose to de-compose the image matrix into a low-rank part (A)capturing the low frequency shape information and asparse part (E) accounting for gross but sparse noisesuch as partial occlusions and pixel corruptions. Topromote low-rank and sparsity the nuclear norm (de-note by ‖·‖∗) and the `1-norm (denote by ‖·‖1) areemployed, respectively. Formally we propose to solvethe following non-convex optimisation problem:

    argminA,E,B,L,C

    ‖A‖∗ + λ‖E‖1 +µ

    2‖A−B(L ∗C)‖2F

    subject to X = A + E, BTB = I. (7)

    Although the above problem is non-convex, an accuratesolution can be obtained by employing the AlternatingDirections Method (ADM) [6]. That is, to minimisethe following augmented Lagrangian function:

    L(A,E,B,C,L,Y) =

    ‖A‖∗ + λ‖E‖1 +µ

    2‖A−B(L ∗C)‖2F+

    tr (YT (X−A−E)) + µ2‖X−A−E‖2F ,

    (8)

    with respect to BTB = I. Let t denote the iterationindex. Given A[t], E[t], B[t], C[t], L[t], Y[t] and µ[t],the iteration of ADM for Equation 7 reads:

    A[t+1] = argminA[t]

    L(A[t],E[t],B[t],C[t],L[t],Y[t])

    = ‖A[t]‖∗ +µ[t]

    2

    (‖A[t] −B[t](L[t] ∗C[t])‖2F+

    ‖X−A[t] −E[t] +Y[t]

    µ[t]‖2F), (9)

    E[t+1] = argminE[t]

    λ‖E[t]‖1+

    µ[t]

    2‖X−A[t+1] −E[t] +

    Y[t]

    µ[t]‖2F , (10)

    B[t+1] = argminBT

    [t]B[t]=I

    µ[t]

    2‖A[t+1] −B[t](L[t] ∗C[t])‖2F ,

    (11)[L[t+1],C[t+1]

    ]=

    argminL[t],C[t]

    µ[t]

    2‖A[t+1] −B[t+1](L[t] ∗C[t])‖2F . (12)

  • Subproblem (9) admits a closed-form solution, givenby the singular value thresholding (SVT)[8] operatoras:

    A[t+1] = Dµ−1[t]

    [M[t] −A[t] + X−E[t] +

    Y[t]

    µ[t]

    ], (13)

    where M[t] = B[t](L[t] ∗C[t]) is introduced for brevityof the equation and the SVT is defined as Dτ (Q) =USτV

    T for any matrix Q with SVD: Q = USVT .Subproblem (10) has a unique solution that is obtainedvia the elementwise shrinkage operator [9]. The shrink-age operator is defined as Sτ [q] = sgn(q) max(|q|−τ, 0).Therefore, the solution of (10) is

    E[t+1] = Sλµ−1[t]

    [X−A[t+1] +

    Y[t]

    µ[t]

    ]. (14)

    Subproblem (11) is a reduced rank Procrustes Rota-tion problem [52]. Its solution is given by B[t] = UV

    >

    with

    A[t+1](L[t] ∗C[t]

    )>= UΣV>, (15)

    being the SVD of A[t+1](L[t] ∗C[t]

    )>. However, due

    the unitary invariance of the Frobenius norm, Equa-tion 12 becomes

    argminL[t+1],C[t+1]

    ‖BT[t+1]A[t+1] − L[t] ∗C[t]‖2F . (16)

    Subproblem (12, 16) is a least squares factorisation of aKhatri-Rao product [40], which is solved as follows: LetQ = B[t+1]

    >A[t+1],L = L[t], and C = C[t]. Further-more, let qi, li, and ci be the ith columns of matricesQ,L, and C, respectively. Clearly qi = li ⊕ ci, where⊕ denotes the Kronecker product. For each column ofQ: Reshape qi into a matrix Q̃i ∈ Rl×k×N such thatvec(Q̃i

    )= qi. Obviously, Q̃i = ci · l>i is a rank-one

    matrix. Compute the SVD of Q̃i as Q̃i = UiΣiV>i .

    The best rank-one approximation of Q̃i is obtained bytruncating the SVD as: li = ui

    √σ1 and ci =

    √σ1vi,

    where ui and vi are the first column vectors of Ui andVi, respectively, and σ1 is the largest singular value.The ADM for solving (7) is outlined in Algorithm 1.

    It is important to note that there are inherent am-biguities in this decomposition, both from the SVD torecover B and in the Khatri-Rao factorisation to re-cover L and C. In particular, we are most concernedabout how they may affect the recovered normals be-fore we integrate them to recover depth. In order toresolve these ambiguities, we take the simplest possibleapproach, we recover the ambiguity matrix from a tem-plate set of normals provided by a known mean face.

    Algorithm 1 Solving (7) by the ADM method.

    Input: Data Matrix X ∈ Rd×n and parameter λ.Output: Matrices A, E, B, C, L.

    1: Initialise: A[0] = 0, E[0] = 0, B[0] = 0, C[0] = 0, L[0] = 0,Y[0] = 0, µ[0] = 10

    −6, ρ = 1.1, � = 10−8

    2: while not converged do do3: Fix E[t], B[t], C[t], L[t] and update A[t+1] by

    A[t+1] = Dµ−1[t]

    [B[t](L[t] ∗C[t])−A[t] + X− E[t] +

    Y[t]

    µ[t]

    ](17)

    4: Fix A[t+1], L[t], C[t], L[t] and update A[t+1] by

    E[t+1] = Sλµ[t]−1

    [X−A[t+1] +

    Y[t]

    µ[t]

    ](18)

    5: Update B[t+1] by first performing the SVD on:

    A[t+1](L[t] ∗C[t])T

    = UΣV, B[t+1] = UVT

    (19)

    6: Update [L[t+1],C[t+1]] via a Least Squares Khatri-Rao fac-torization, as described in Section 3.4

    7: Update Lagrange multipliers by

    Y[t+1] = Y[t] + µ[t](X−A[t+1] − E[t+1]

    )(20)

    8: Update µ[t+1] by µ[t+1] = min(ρµ[t], 106)

    9: Check convergence condition

    ‖X−A[t+1] − E[t+1]‖∞ < �,

    ‖A[t+1] −B[t+1] − (L[t+1] ∗C[t+1])‖∞ < �(21)

    10: t← t+ 111: end while

    3.5. Efficient Pixelwise Correspondence

    In contrast to the related work of Kemelmacher-Shlizerman [19], we achieved pixelwise correspondencebetween our images by using existing, efficient sparsefacial alignment algorithms. This has two distinct ad-vantages. Firstly, recent facial alignment algorithmssuch as those by Ren et al . [39] and Kazemi et al . [18]can produce a very accurate set of sparse facial featuresin the order of a single millisecond. In contrast, the op-tical flow method cited in [19] takes multiple secondseven for a small image. This means that our trainingtime is drastically reduced in comparison to [19]. Ide-ally, our technique would be able to scale to the mag-nitude of thousands of images, whereas the alignmentof [19] would quickly become infeasible as the numberof images increases. In fact, the optical flow step isrun multiple times as the collection flow algorithm isused [22] which involves an iterative algorithm of rank4 decompositions and repeated optical flow. Secondly,the use of a direct alignment to a single reference frame

  • (a) (b) (c) (d)

    Figure 2: Example of the low rank effect onwarped pose. (a) initial input image (b) input im-age after warping (c) warped image after the low rankconstraint (d) recovered depth from (c).

    enables the the usage of our basis in existing appear-ance based facial alignment algorithms such as AAMs.This means that our basis can be used to reconstructdense 3D shape of faces directly from an existing AAMfitting provided the reference space of the AAM andour subspace is the same.

    However, it is important to note that there are twopotential drawbacks to our alignment technique. Thealignment is based on a Piecewise Affine warping andis thus much coarser than the optical flow techniqueused in [19]. This is particularly amplified when largerposes are present in the input images. However, this ispartly why the low rank component of our algorithmis so important. As Figure 2 shows, the robust decom-position of the basis helps correct these large globalerrors so that the shape subspace can be successfullyrecovered. Secondly, our technique does not contain anumber of sub-clusters that can be used to warp expres-sion onto our model. However, by using a large numberof images that contain expression we directly includeexpression within our subspace. In [19], the recoveredsubspace will necessarily be devoid of expression as theglobal reference shape is neutral. This means that thesubspace recovered by [19] will not be able to recoverexpressive 3D shape using efficient facial alignment al-gorithms.

    4. Experiments

    In this section we provide a number of experimentsthat emphasise the increase in robustness of our re-constructions. We also show a new application to thistype of model that involves improving the fitting resultsof an AAM using our constructed SH basis. Choos-ing the number of components, k, to recover is animportant problem that was not properly addressedby Kemelmacher-Schlizerman in [19]. In these experi-ments we attempt to recover as many components aspossible in order to strike a balance between cleanlyreconstructed normals and identity. However, there isa trade-off when choosing the value of k. In particular,

    Initial Final Recovered Depth

    Figure 4: Person specific model fitting for TomHanks. Images of Tom Hanks coarsely aligned by a fa-cial alignment method. Our algorithm improves the fa-cial alignment and simultaneously recovers depth. Im-ages shown are from a YouTube video of Tom Hanks.

    (a) (b) (c)

    (d) (e) (f)

    Figure 5: Our subspace used for SFS. Normalslearnt automatically from the SH subspace of HELENvs normals from the clean data of ICT-3DRFE. (b, e)the clean data (c, f) proposed subspace.

    if the value of k is too large, then the decomposition isunable to separate the identity and shape and the sub-space of shape no longer represents valid normals. Thisis one of the primary advantages of our robust decom-position, as it allows the value of k to be larger giventhe reduced rank of the images. However, a potentialdisadvantage of our proposed method is the sensitivityof the algorithm to the parameter λ, which must betuned for every dataset. It is also important to stressthat our main goal is to recover the low frequency shapeinformation to provide plausible 3D facial surfaces un-der challenging conditions. However, in Section 4.3, we

  • Input Proposed [19] Proposed [19]

    Figure 3: Comparison with the blind decomposition of [19]. Images from the HELEN[25] dataset.

    show that our recovered subspace can be used in exist-ing high frequency recovery algorithms such as SFS.

    The area of 3D facial surface recovery is lacking anyform of formal quantitative benchmark. The quantita-tive benchmark presented in [19] is performed on depthdata recovered from photometric stereo. This is notground truth depth data, as error is introduced dur-ing integration, and a more accurate evaluation wouldbe the angular error of the recovered normals. How-ever, in the presence of cast shadows, even the normalsof photometric stereo are biased. For this reason, thelack of a standard and fair quantitative evaluation, wefocus on qualitative results in this paper.

    Specifically we performed the following experiments:(1) We built our subspace using the HELEN[25]dataset. We directly compare against the blind de-composition proposed in [19] and show particularlychallenging images from the dataset. This experimenthighlights the difficulty in constructing subspaces fromlarge a set of in-the-wild images. (2) We show that therobust subspace learnt in (1) can be used within theshape-from-shading (SFS) framework of Smith et al .[44]. By recovering the normals from every image ofHELEN, we can perform a secondary principal compo-nent analysis (PCA) on the normals in order to directlyembed them within Smith’s algorithm. In this exper-iment, we compare against a clean dataset of normalsacquired from the ICT-3DRFE[46] database. (3) Weshow how our subspace can be combined with an ex-isting facial alignment algorithm, namely project-out

    AAMs [34]. Our subspace can be used both as the ap-pearance basis for the AAM and also as a methodologyof recovering dense 3D shape.

    In the following section we describe the constructionof the bases and explain what processing was performedon each dataset.

    4.1. Constructing The Robust Bases

    The process of building the robust SH basis wasthe same for all datasets involved. Facial annotationsconsisting of 68 points were recovered through vari-ous methods for each dataset. In the case of the HE-LEN database, the manual annotations provided bythe IBUG group were used [42, 43], in the case of theYale B, Photoface and ICT-3DRFE databases, man-ual annotations were used and the in-the-wild imagesand video of Tom Hanks were automatically annotatedby the one millisecond facial alignment method of [18]provided by the Dlib project [24].

    These annotations were then warped via a Piece-wise Affine transformation to a mean reference shapethat was built from all the faces, training and testing,of the LFPW facial annotations provided by IBUG.This provided the dense correspondence required forperforming matrix decompositions. To construct ourSH bases, we performed the algorithm as described inSection 3.4 on the warped images. In order to providethe example reconstructions, the reconstructed imageswere warped back into their original shapes and thenintegrated using the method of Frankot and Chellappa

  • [14].Table 1 gives examples of the training time taken

    for the in-the-wild Tom Hanks images and the HELENdataset. It is important to note that part of the rea-son the training time is much lower for the Tom Hanksimages is that they have an inherently lower rank thanthe HELEN images as they are all of the same individ-ual. This greatly affects the convergence time and thusthe timings do not scale linearly.

    Data W1 Train W2 TotHELEN (2330) 8 730 25 763T. Hanks (274) 1 21 4 26

    Table 1: Training Times. Mean training times inseconds over 10 runs rounded to the nearest second.’W1’ denotes warping to the LFPW reference frame of(150 × 150) pixels, ’W2’ denotes warping back to theoriginal images and ’Train’ denotes the total trainingtime of our method described in Section 3.4. Originalimages were larger than the reference, hence the in-crease from ’W1’ to ’W2’. Timings were recorded onan Intel Xeon E5-1650 3.20GHz with 32GB of RAM.

    4.2. Comparison Using HELEN

    In this set of experiments we wished to convey tworesults: (1) that we are capable of quickly constructingour basis on a large number of in-the-wild images, (2)that the our robust formulation of the problem givessuperior performance to the blind decomposition usedby [19]. In this experiment, k = 200 and the total num-ber of components was thus 4k = 800. Figure 3 showsthe results from this experiment. As we can clearly see,on challenging images the blind decomposition is un-able to separate the appearance from the illuminationand thus the recovered normals are unable to recoveryaccurate shape.

    4.3. Using The Subspace In SFS

    The SFS technique of Smith et al . [44] relies on aPCA basis constructed from normals of a single classof object. It then seeks to recover the high frequencynormal information directly from the texture. In orderto create the PCA required by [44], we recovered spher-ical harmonics for every image in the dataset using theproposed algorithm. We then computed Kernel-PCA[45] on the normals recovered from the HELEN imagesand supplied them to [44]. The lighting vector is alsoan input to the algorithm and we recover it by solvinga least squares problem with the known normals.

    In order to provide a comparison for our reconstruc-tion, we created a clean normal subspace using the data

    from the ICT-3DRFE [46] database. This database isprimarily use for image relighting purposes, however,they provide a very accurate set of normals of facesunder a wide range of expressions. The results of thisexperiment are shown in Figure 5. Although our sub-space did not provide reconstructions that are as visu-ally accurate as the subspace from ICT-3DRFE, theywere still able to successfully recover a plausible repre-sentation of the high frequency shading information.

    4.4. Automatic Alignment

    In this experiment we used the Active TemplateModel (ATM) provided by the Menpo project [2] inorder to perform a project-out type algorithm to alignimages of Tom Hanks. This model is similar to theLucas-Kanade [33] method but uses a point distribu-tion model (PDM) in order to perform non-rigid align-ment between the images. In particular, the templateimage is fixed during optimisation of the PDM, and weuse our subspace to provide a texture representing anapproximation of the diffuse component of the image.This is essentially identical to the procedure performedwithin a project-out AAM.

    We used a person specific SH subspace that was builton images of Tom Hanks that were downloaded auto-matically from the Internet. In this case, the imageswere automatically aligned using the DLib implemen-tation of [18]. For this experiment, k = 30 and thusthe total number of components 4k = 120. We down-loaded 200 frames from a Youtube video of Tom Hanks1

    and attempted to automatically align them using oursubspace and the ATM. The ATM was initialised us-ing another fitting of [18] which was then iterativelyimproved. At each global iteration, we recovered anew set of diffuse textures for each frame and thenperformed a refitting of every frame. This caused theimages to align over a sequence of iterations. We per-formed 10 such iterations. Figure 4 shows two exampleframes where the alignment was improved and denseshape was also recovered.

    5. Conclusion

    We have proposed a robust method for automati-cally constructing generalised spherical harmonic sub-spaces. In particular, we have shown that by using acommon reference frame as defined in algorithms suchas AAMs, we can efficiently build models that haveapplications in shape recovery and facial alignment.

    1https://www.youtube.com/watch?v=nFvASiMTDz0 from3:43

  • Acknowledgements

    Patrick Snape is funded by a DTA from ImperialCollege London and by a Qualcomm Innovation Fel-lowship. Yannis Panagakis is funded by the ERC underthe FP7 Marie Curie Intra-European Fellowship. Ste-fanos Zafeiriou is partially supported by the EPSRCproject EP/J017787/1 (4D-FAB).

    References

    [1] E. H. Adelson and A. P. Pentland. The perceptionof shading and reflectance. Perception as Bayesianinference, pages 409–423, 1996.

    [2] J. Alabort-i Medina, E. Antonakos, J. Booth,P. Snape, and S. Zafeiriou. Menpo: A comprehensiveplatform for parametric image alignment and visualdeformable models. In Proceedings of the ACM In-ternational Conference on Multimedia, MM ’14, pages679–682, New York, NY, USA, 2014. ACM.

    [3] J. T. Barron and J. Malik. Shape, illumination, andreflectance from shading. IEEE T-PAMI, 2015.

    [4] R. Basri, D. Jacobs, and I. Kemelmacher. Photo-metric stereo with general, unknown lighting. IJCV,72(3):239–257, 2006.

    [5] R. Basri and D. W. Jacobs. Lambertian reflectanceand linear subspaces. IEEE T-PAMI, 25(2):218–233,2003.

    [6] D. P. Bertsekas. Constrained optimization and la-grange multiplier methods. 1982.

    [7] V. Blanz and T. Vetter. A morphable model for thesynthesis of 3d faces. In SIGGRAPH, pages 187–194,1999.

    [8] J.-F. Cai, E. J. Candes, and Z. Shen. A singular valuethresholding algorithm for matrix completion. SIAMJournal on Optimization, 20(4):1956–1982, 2010.

    [9] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robustprincipal component analysis? Journal of the ACM,58(3):1–37, 2011.

    [10] C. Cao, Y. Weng, S. Lin, and K. Zhou. 3d shape re-gression for real-time facial animation. TOG, 32(4):1,2013.

    [11] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou.Facewarehouse: A 3d facial expression database forvisual computing. TVCG, 20(3):413–425, 2014.

    [12] X. Cheng, S. Sridharan, J. Saragih, and S. Lucey. Rankminimization across appearance and shape for aam en-semble fitting. In ICCV, pages 577–584. IEEE, 2013.

    [13] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Ac-tive appearance models. IEEE T-PAMI, 23(6):681–685, 2001.

    [14] R. T. Frankot and R. Chellappa. A method for en-forcing integrability in shape from shading algorithms.IEEE T-PAMI, 10(4):439–451, 1988.

    [15] D. Frolova, D. Simakov, and R. Basri. Accuracy ofspherical harmonic approximations for images of lam-bertian objects under far and near lighting. In T. Pa-jdla and J. Matas, editors, ECCV, volume 3021 of

    Lecture Notes in Computer Science, pages 574–587.Springer Berlin Heidelberg, 2004.

    [16] A. S. Georghiades, P. N. Belhumeur, and D. J. Krieg-man. From few to many: illumination cone modelsfor face recognition under variable lighting and pose.IEEE T-PAMI, 23(6):643–660, 2001.

    [17] T. Hassner. Viewing real-world faces in 3d. In ICCV,pages 3607–3614. IEEE, 2013.

    [18] V. Kazemi and J. Sullivan. One millisecond face align-ment with an ensemble of regression trees. In CVPR,pages 1867–1874. IEEE, 2014.

    [19] I. Kemelmacher-Shlizerman. Internet based morphablemodel. In ICCV, pages 3256–3263. IEEE, 2013.

    [20] I. Kemelmacher-Shlizerman and R. Basri. 3d face re-construction from a single image using a single refer-ence face shape. IEEE T-PAMI, 33(2):394–405, 2011.

    [21] I. Kemelmacher-Shlizerman and S. M. Seitz. Face re-construction in the wild. In CVPR, pages 1746–1753.IEEE, 2011.

    [22] I. Kemelmacher-Shlizerman and S. M. Seitz. Collectionflow. In CVPR, pages 1792–1799. IEEE, 2012.

    [23] C. G. Khatri and C. R. Rao. Solutions to some func-tional equations and their applications to characteriza-tion of probability distributions. Sankhya: The IndianJournal of Statistics, 1968.

    [24] D. E. King. Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research, 10:1755–1758,2009.

    [25] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang.Interactive facial feature localization. In Image Anal-ysis and Processing, pages 679–692. Springer BerlinHeidelberg, Berlin, Heidelberg, 2012.

    [26] J. Lee, R. Machiraju, B. Moghaddam, and H. Pfister.Estimation of 3d faces and illumination from singlephotographs using a bilinear illumination model. InEurographics Symposium on Rendering. EurographicsAssociation, 2005.

    [27] J. Lee, B. Moghaddam, H. Pfister, and R. Machiraju.A bilinear illumination model for robust face recogni-tion. In ICCV, pages 1177–1184. IEEE, 2005.

    [28] M. Lee and C. H. Choi. Fast facial shape recoveryfrom a single image with general, unknown lightingby using tensor representation. Pattern Recognition,44(7):1487–1496, 2011.

    [29] M. Lee and C.-H. Choi. A robust real-time algorithmfor facial shape recovery from a single image containingcast shadow under general, unknown lighting. PatternRecognition, 46(1):38–44, 2013.

    [30] M. Lee and C.-H. Choi. Real-time facial shape recoveryfrom a single image under general, unknown lightingby rank relaxation. CVIU, 120:59–69, 2014.

    [31] Z. Lei, Q. Bai, R. He, and S. Z. Li. Face shape recoveryfrom a single image using cca mapping between tensorspaces. In CVPR, pages 1–7, 2008.

    [32] F. Lu, Y. Matsushita, I. Sato, T. Okabe, and Y. Sato.Uncalibrated photometric stereo for unknown isotropicreflectances. In CVPR, pages 1490–1497. IEEE, 2013.

  • [33] B. D. Lucas and T. Kanade. An iterative image regis-tration technique with an application to stereo vision.In Proceedings of the 7th international joint conferenceon Artificial intelligence, 1981.

    [34] I. Matthews and S. Baker. Active appearance modelsrevisited. IJCV, 60(2):135–164, 2004.

    [35] T. Papadhimitri and P. Favaro. A new perspectiveon uncalibrated photometric stereo. In CVPR, pages1474–1481. IEEE, 2013.

    [36] T. Papadhimitri and P. Favaro. A closed-form, con-sistent and robust solution to uncalibrated photomet-ric stereo via local diffuse reflectance maxima. IJCV,107(2):139–154, 2014.

    [37] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma.Rasl: Robust alignment by sparse and low-rank de-composition for linearly correlated images. IEEE T-PAMI, 34(11):2233–2246, 2012.

    [38] R. Ramamoorthi and P. Hanrahan. On the relation-ship between radiance and irradiance: determining theillumination from images of a convex lambertian ob-ject. JOSA, 18(10):2448–2459, 2001.

    [39] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at3000 fps via regressing local binary features. In CVPR,pages 1685–1692. IEEE, 2014.

    [40] F. Roemer and M. Haardt. Tensor-based channel esti-mation and iterative refinements for two-way relayingwith multiple antennas and spatial reuse. IEEE Trans-actions On Signal Processing, 58(11):5720–5735, 2010.

    [41] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic.Raps: Robust and efficient automatic construction ofperson-specific deformable models. In CVPR, pages1789–1796. IEEE, 2014.

    [42] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, andM. Pantic. 300 faces in-the-wild challenge: The firstfacial landmark localization challenge. In ICCV Work-shops, pages 397–403. IEEE.

    [43] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, andM. Pantic. A semi-automatic methodology for faciallandmark annotation. In CVPR Workshops, pages896–903. IEEE, 2013.

    [44] W. A. P. Smith and E. R. Hancock. Recovering fa-cial shape using a statistical model of surface normaldirection. IEEE T-PAMI, 28(12):1914–1930, 2006.

    [45] P. Snape and S. Zafeiriou. Kernel-pca analysis of sur-face normals for shape-from-shading. In CVPR, pages1059–1066. IEEE, 2014.

    [46] G. Stratou, A. Ghosh, P. Debevec, and L.-P. Morency.Exploring the effect of illumination on automatic ex-pression recognition using the ict-3drfe database. Im-age and Vision Computing, 30(10):728–737, 2012.

    [47] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang,and D. Samaras. Face relighting from a single imageunder arbitrary unknown lighting conditions. IEEET-PAMI, 31(11):1968–1984, 2009.

    [48] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtimeperformance-based facial animation. In SIGGRAPH,SIGGRAPH, pages 77:1–77:10, New York, NY, USA,2011. ACM.

    [49] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang,and Y. Ma. Robust photometric stereo via low-rank matrix completion and recovery. In R. Kimmel,R. Klette, and A. Sugimoto, editors, ACCV, volume6494 of Lecture Notes in Computer Science, pages 703–717. Springer Berlin Heidelberg, 2011.

    [50] S. Zafeiriou, G. A. Atkinson, M. F. Hansen, W. A. P.Smith, V. Argyriou, M. Petrou, M. L. Smith, and L. N.Smith. Face recognition and verification using photo-metric stereo: The photoface database and a compre-hensive evaluation. IEEE Information Forensics andSecurity, 8(1):121–135, 2013.

    [51] S. Zhou, G. Aggarwal, R. Chellappa, and D. Ja-cobs. Appearance characterization of linear lam-bertian objects, generalized photometric stereo, andillumination-invariant face recognition. IEEE T-PAMI, 29(2):230–245, 2007.

    [52] H. Zou, T. Hastie, and R. Tibshirani. Sparse princi-pal component analysis. Journal of computational andgraphical statistics, 15(2):265–286, 2006.