-
Towards Unsupervised Whole-Object Segmentation:Combining
Automated Matting with Boundary Detection
Andrew N. Stein∗ Thomas S. Stepleton Martial HebertThe Robotics
Institute, Carnegie Mellon University
Pittsburgh, Pennsylvania 15213, [email protected]
Abstract
We propose a novel step toward the unsupervised seg-mentation of
whole objects by combining “hints” of partialscene segmentation
offered by multiple soft, binary mattes.These mattes are implied by
a set of hypothesized objectboundary fragments in the scene. Rather
than trying to findor define a single “best” segmentation, we
generate multi-ple segmentations of an image. This reflects
contemporarymethods for unsupervised object discovery from groups
ofimages, and it allows us to define intuitive evaluation met-rics
for our sets of segmentations based on the accurate andparsimonious
delineation of scene objects. Our proposedapproach builds on recent
advances in spectral clustering,image matting, and boundary
detection. It is demonstratedqualitatively and quantitatively on a
dataset of scenes andissuitable for current work in unsupervised
object discoverywithout top-down knowledge.
1. Introduction
It is well known that generalscenesegmentation is anill-posed
problem whose “correct” solution is largely de-pendent on
application, if not completely subjective. Ob-jective evaluation of
segmentations is itself the subject ofsignificant research (see
[31] for a recent review) . Here weconsider instead the more
specific problem ofwhole objectsegmentation;i.e., our goal is to
accurately and conciselysegment the foreground objects or “things”
without nec-essarily worrying about the background or “stuff” [1]
—without the use of top-down object knowledge. As we willexplain,
we use hypothesized boundary fragments to sug-gest partial
segmentation “hints” to achieve this goal. Oncethe objects of
interest are defined (which admittedly coulditself involve some
subjectivity), it becomes somewhat eas-ier to define natural and
intuitive measures of segmentationquality on a per-object
basis.
∗Partial support provided by National Science Foundation (NSF)
GrantIIS-0713406.
Input Image
High−ProbabilityBoundary Fragments
Pairwise Affinities
Object Segmentation
... ...
Segmentation "Hints"
Figure 1. Starting with an image and hypothesized
boundaryfragments (only highest-probability fragments shown for
clarity,though many more are used), we generate a large set of
segmen-tation “hints” using automated matting. Combining
informationfrom those mattes into an affinity matrix, we can then
generate asegmentation for each of the foreground objects in the
scene.
Furthermore, segmentation results are intimately tied tothe
selection of the parameter controlling the granularityofthe
segmentation (e.g., the number of segments/clusters, ora bandwidth
for kernel-based methods). Instead of seek-ing a single perfect
segmentation, the integration ofmulti-ple segmentations obtained
over a range of parameter set-tings has become common [9, 12, 17,
21, 29]. In such ap-proaches, segmentation is treated as a
mid-level processingstep rather than an end goal. This reduces the
pressure toobtain — or even define — the one “best” result, while
alsoside-stepping the problem of parameter selection.
We are motivated by approaches such as [21], in whichRussellet
al. showed that it is possible to discover objectsin an
unsupervised fashion from a collection of images byrelying on
multiple segmentations. Systems using segmen-tation in this way,
effectively as aproposal mechanism,should benefit from an
underlying segmentation methodwhich accurately and frequently
delineates whole objects.Thus we will present a novel segmentation
strategy de-signed to outperform existing methods in terms of two
inter-related and intuitive measures of object segmentation
qual-
1
-
ity. First, we take into account accuracy in terms of
pixel-wise,per-objectagreement between segments. Second, wealso
consider conciseness in terms of the number of seg-ments needed to
capture an object.
Following the classical Gestalt view as well as Marr’stheories,
we recognize that a key cue in differentiating ob-jects from each
other and their background lies in the dis-covery of their
boundaries. Therefore we use hypothesizedfragmentsof boundaries as
input for our segmentations.Certainly there exists substantial
prior work in directly ex-ploiting boundary reasoning for object/
scene segmentationand figure-ground determination,e.g. [18] and
referencestherein. Recently, however, Steinet al. [26] have
arguedfor the importance of detectingphysical boundariesdue
toobject pose and 3D shape, rather than relying on more typ-ical,
purely appearance-basededges, which may arise dueto other phenomena
such as lighting or surface markings.They have demonstrated
improved boundary detection us-ing cues derived from a combination
of appearance andmotion, where the latter helps incorporate local
occlusioninformation due to parallax from a dynamic scene/
camera.We employ this method for generating the necessary bound-ary
information for our approach.
Not only do boundaries indicate the spatial extent of ob-jects,
however, they also suggest natural regions to use inmodeling those
objects’ appearances. The pixels on eitherside of a boundary
provide evidence which, though only lo-cal and imperfect, can offer
a “hint” of the correct segmen-tation by indicating discriminative
appearance information.
In practice, we use the output offered by recentα-matting
approaches as an approximate “classification”which realizes these
segmentation hints. And though theindividual mattes only suggest
binary discrimination (usu-ally “foreground” vs. “background”), we
can neverthelesssegment arbitrary numbers of objects in the scene
by utiliz-ing the collective evidence offered by a large set of
mattes.
Recently, there has been a surge of interest and impres-sive
results in interactive matting [2, 4, 6, 11, 22, 28]. Usingonly a
sparse set of user-specified constraints, usually in theform of
“scribbles” or a “tri-map”, these methods produce asoft foreground/
background matte of the entire image. Asinitially demonstrated in
[3], such methods can potentiallybe automated by replacing
user-specified “scribbles” withconstraints indicated by local
occlusion information.
In the proposed approach, we use each hypothesizedboundary
fragment to provide the matting constraints.Based on the
combination of the set of mattes suggestedby all of the fragments,
we then derive an affinity measurewhich is suitable for use with
any subsequent clustering al-gorithm,e.g. Normalized Cuts (NCuts)
[24]. Through thiscombination of individually-weaker sources of
information(i.e. fragmented boundaries and impoverished, binary
mat-tes), we relieve the pressure to extract the one “perfect”matte
or the “true” set of extended boundary contours, andyet we are
still able to segment multiple objects in the scene
accurately and concisely. Though beyond the scope of thiswork,
it is our hope that such improved segmentation strate-gies will
benefit methods which rely on multiple segmenta-tions to generate,
for example, object models from unla-beled data.
2. Segmentation “Hints” via Multiple Mattes
As discussed in Section1, we will use image matting toproduce
segmentation “hints” for generating affinities use-ful in
segmentation via clustering. After first providing anoverview of
matting, we will explain how we use bound-ary fragments to
implymultiplemattes for estimating theseaffinities.
2.1.α-MattingIn a standard matting approach, one assumes that
each
observed pixel in an image,I(x, y), is explained by theconvex
combination of two unknown colors,F andB. Asoft weight,α, controls
this combination:
I(x, y) = α(x, y)F (x, y) + (1 − α(x, y)) B(x, y) (1)
We will use this model as a proxy for a classifier in ourwork.
In this formulation, then, pixels with anα-value nearone are likely
part of theF “class”, while those with anα-value near zero are
likely part of theB “class”. Valuesnear0.5 indicate mixed pixels
whose “membership” maybe considered unknown. Typically,F and B
correspondto notions of “foreground” and “background”, but we
willexplain in the next section that these semantic assignmentsare
not necessary for our approach.
Simultaneously solving forF , B, and α in (1) is ofcourse not
feasible. In general, a user specifies a small setof pixels, often
referred to as “scribbles” or a “tri-map”,which are then
constrained to belong to one class or theother. These hard
constraints are then combined with as-sumptions aboutα (e.g.,
smoothness) to find a solution atthe unspecified pixels [2, 4, 6,
11, 22, 28]. We have adoptedthe approach proposed by Levinet al.
[11], which offers ex-cellent results via a closed-form solution
forα based on rea-sonable assumptions about the local distributions
of colorinnatural images.
Note that methods also exist for constrained “hard”
seg-mentations,e.g. [5] — potentially with some soft mattingat the
boundaries enforced in post-processing [20] — but aswe will see,
using fully-softα-mattes allows us to exploitthe uncertainty of
mixed pixels (i.e. those withα values near0.5) rather than
arbitrarily (and erroneously) assigning themto one class or the
other. In fact, we follow the conventionalwisdom of avoiding early
commitment throughout our ap-proach. Hard thresholds or grouping
decisions are avoidedin favor of maintaining soft weights until the
final segmen-tation procedure. In addition to retaining the
maximumamount of information for the entire process, this
method-ology also avoids the many extra parameters required
fortypical ad hocdecisions or thresholds.
-
2.2. Multiple Mattes → AffinitiesIn anautomatedmatting approach,
the goal is to provide
a good set of constraints without requiring human interven-tion.
Since object boundaries separate two different objectsby
definition, they are natural indicators of potential con-straint
locations:F on one side andB on the other. In [3],T-junctions were
used to suggest sparse constraints in a sim-ilar manner. The
benefit of matting techniques here is theirability to propagate
throughout the image the appearanceinformation offered by such
local, sparse constraints. Themiddle of Figure1 depicts a sampling
of mattes generatedfrom the differing constraints (or “scribbles”)
implied byvarious boundary fragments.
A remaining problem is that the approach describedthus far only
offers a binary (albeit soft) decision about apixel’s membership:
it must belong to eitherF or B, whichare usually understood to
represent foreground and back-ground. How then can we deal with
multi-layered scenes?
We recognize that the actual class membership of a par-ticular
pixel, as indicated by itsα value in a single matte,is rather
meaningless in isolation. What we wish to cap-ture, however, is
that locations with similarα values acrossmany different mattes
(whether both zeroor one) are morelikely to belong to the same
object, while locations withconsistently different values are more
likely to be part ofdifferent objects. While existing methods, such
as Interven-ing Contours [7, 8, 10], may attempt to perform this
type ofreasoning over short- and medium-ranges within the
imageusing standard edge maps, our use of mattes
simultaneouslyfactors in the boundaries themselves as well as
theglobalappearance discriminationthey imply. Furthermore,
mattevalues near 0.5 carry a notion ofuncertaintyabout the
rela-tionship between locations.
Using each of theNF potential boundary fragments inthe scene to
generate an image-wide matte yields anNF -length vector,vi, of
α-values at each pixeli. If we scalethe α-values to be between+1
and−1 (instead of 0 and1), such that zero now represents “don’t
know”, then theagreement, oraffinity, between two pixelsi and j can
bewritten in terms of the normalized correlation between theirtwo
scaled matting vectors:
Aij =vTi Wvj
|vi||vj |, (2)
whereW is a diagonal matrix of weights corresponding tothe
confidence of each fragment actually being an objectboundary. Thus
mattes derived from fragments less likelyto be object boundaries
will not significantly affect the finalaffinity. Note that this
combinesall hypothesized fragmentsin a soft manner, avoiding the
need to choose some idealsubset,e.g., via thresholding. Figure2
provides an exam-ple schematic describing the overall process of
employingboundary fragments to suggest mattes, which in turn
gener-ate an affinity matrix.
The value ofAij will be maximized when the matting
vectors for a pair of pixels usually put them in the sameclass.
When a pair’s vectors usually put the two pixels inopposite
classes, the affinity will be minimized. The nor-malization
effectively handles discounting the “don’t know”(near-zero) entries
which arise from mattes that do not pro-vide strong evidence for
one or both pixels of the pair.
We have now defined a novel matting-based affinity mea-sure
which can be used with any off-the-shelf clusteringtechnique. In
addition to incorporating boundary knowl-edge and global appearance
reasoning, a unique and note-worthy aspect of our affinities is
that they are defined basedon feature vectors in a space whose
dimension is afunctionof the image content— i.e. the number of
detected frag-ments,NF — rather than an arbitrarily pre-defined
featurespace of fixed dimension.
3. Detecting Boundary Fragments
In the previous section, we constructed mattes based
onhypothesized boundary fragments. We will now explain thesource of
these hypotheses. While one could use a purelyappearance-based edge
detector, such asPb [13], as a proxyfor suggesting locations of
object boundaries in a scene,this approach could also return many
edges correspondingto non-boundaries, such as surface markings,
yielding extraerroneous and misleading mattes. While it would be
naı̈veto expect or requireperfectboundary hypotheses from
anymethod, we still wish to maximize the fraction that do in-deed
correspond to true object boundaries.
Recently, Steinet al. demonstrated improved detectionof object/
occlusion boundaries by incorporating local, in-stantaneous motion
estimates when short video clips areavailable [26, 27]. Their
method first over-segments a sceneinto a few hundred “super-pixels”
[19] using a watershed-based approach. All super-pixel boundaries
are used as po-tential object boundary fragments (where each
fragment be-gins and ends at the intersection of three or more
super-pixels). Using learned classifiers and inference on a
graphi-cal model, they estimate the probability that each
fragmentis an object boundary based on appearance and motion
cuesextracted along the fragment and from within each of
theneighboring super-pixels. The resulting boundary probabil-ities
provide the weights forW in (2). An example inputimage and its
boundary fragments (shown in differing col-ors), can be found on
the left side of Figure1. For clarity,only the high-probability
fragments are displayed, thoughwe emphasize thatall are used.
For each fragment, we can now generate an image-widematte
according to [11] as described in Section2. Thesuper-pixels on
either side of a fragment naturally designatespatial support for
the required labeling constraints. Sincesuper-pixels generally
capture fairly homogeneous regions,however, we have found it better
to expand the constraintset for a fragment by using atriplet of
super-pixels formedby also incorporating the super-pixels of the
two neighbor-ing fragments most likely to be boundaries. This
process is
-
B
FCompare
B
FB vi
vj
i
j
Aij
F
Boundary Fragments → Matting Constraints Segmentation "Hints" →
Feature Vectors Pairwise Comparisons → Affinity Matrix
Figure 2. Each potential boundary fragment found in the original
image(three shown here) implies a set of constraints (F andB) used
togenerate a matte. Vectors of matting results at each location in
the image (vi andvj) can be compared in a pairwise fashion to
generate anaffinity matrix,A, suitable for clustering/
segmentation.
illustrated in Figure3. Note also that our approach avoidsany
need to choose the “right” boundary fragments (e.g.
bythresholding), nor does it attempt to extract extended con-tours
using morphological techniques or anad hocchainingprocedure, both
of which are quite brittle in practice. In-stead we consider only
the individual short fragments, withan average length of 18 pixels
in our experiments.
We employ the technique proposed in [26, 27] in orderto improve
our performance by offering better boundary de-tection and, in
turn, better mattes. As mentioned above,that approach utilizes
instantaneous motion estimates nearboundaries. We arenot, however,
incorporating motion es-timates (e.g. optical/ normal flow)
directly into our segmen-tation affinities [23, 32], nor is our
approach fundamentallytied to the use of motion.1
Pr = 0.4
Pr = 0.8
Pr = 0.3
Pr = 0.7Pr = 0.2
Pr = 0.1
B
F
B
F
BB
F
F
Figure 3. TheF andB constraint sets for a given fragment
areinitialized to its two neighboring super-pixels (left). To
improvethe mattes, these constraint sets are expanded to also
include thesuper-pixels of the fragment’s immediate neighbors with
the high-est probability of also being object boundaries
(right).
4. Image Segmentation by NCuts
Given our pairwise affinity matrixA, which defines
afully-connected, weighted graph over the elements in ourimage, we
can use spectral clustering according to the Nor-malized Cut
Criterion to produce an image segmentationwith K segments [24].2 To
obtain a set of multiple segmen-tations, we can simply vary this
parameterK.
1Separate experiments usingPb alone to supply the weights
inWyielded reasonable but slightly worse results than those
presented in Sec-tion 6. This suggests that both the better
boundaries from [26, 27] and thematting-based affinities presented
here are useful.
2Certainly other techniques exist, but NCuts is a one popularone
thatalso offers mature, publicly-available segmentation methods,
facilitatingthe comparisons in Section6. Our work is not
exclusively tied to NCuts.
Using the boundary-detection approach described above,we not
only obtain a better set of hypothesized boundaryfragments and
probabilities of those fragments correspond-ing to physical object
boundaries (i.e. for use inW from(2)), but we can also use the
computed super-pixels as ourbasic image elements instead of relying
on individual pix-els. In addition to offering improved,
data-driven spatialsupport, thisdrastically reduces the
computational burdenof constructingAij , making it possible to
computeall pair-wise affinities between super-pixels instead of
having tosamplepixelwise affinities very sparsely.
The benefit here is more than reduced computation, how-ever,
particularly when using the popular NCuts technique.Other authors
have noted that non-intuitive segmentationstypical of NCuts stem
from an inability to fully populatethe affinity matrix and/or the
topology of a simple, four-connected, pixel-based graph [30]. By
computing affinitiesbetweenall pairs of super-pixels, we alleviate
somewhatboth of these problems while also improving speed.
We will not review here all the details of spectral clus-tering
via NCuts, given a matrix of affinities; see [14] forspecifics of
the method we follow. We compare the segmen-tations obtained using
NCuts on our matting-based affini-ties to two popular approaches
from the literature, each ofwhich also uses NCuts and offers an
implementation online.First is the recent multiscale approach of
Couret al. [7],in which affinities are based on Intervening
Contours [10].Second, we compare to the Berkeley Segmentation
Engine(BSE) [8, 13], which relies on a combined set of patch-based
and gradient-based cues (including Intervening Con-tours). Note
that different methods exist for the final step ofdiscretizing the
eigenvectors of the graph Laplacian. Whilethe BSE usesk-means (as
do we, see [14]), the multiscaleNCuts method attempts to find an
optimal rotation of thenormalized eigenspace [34]. Furthermore,
both methodscompute a sparsely-sampled,pixelwiseaffinity
matrix.
5. Evaluating Object Segmentations
In this section, we will present an intuitive approach
fordetermining whether one set of segmentations is “better”
-
than another generated using a different method. For theeventual
goal of object segmentation and discovery, we pro-pose that the
best set of segmentations will contain at leastone result for each
object in the scene in which that ob-ject can be extracted
accurately and with as few segmentsas possible. Ideally, each
object would be represented by asingle segment which is perfectly
consistent with the shapeof that object’s ground truth
segmentation. We leave theproblem of identifying the object(s) from
sets of multiplesegmentations to future work and methods such as
[21].
Thus, we define two measures to capture these
concepts:consistencyand efficiency. For consistency, we adopt
acommon metric for comparing segmentations, which con-veniently
lies on the interval[0, 1] and captures the degreeto which two sets
of pixels,R andG, agree based on theirintersection and union:
c(R,G) =|R ∩ G|
|R ∪ G|. (3)
Here, R = {A,B,C, . . . } ⊆ S is a subset of segmentsfrom a
given (over-)segmentationS, andG is the groundtruth object
segmentation.
At one extreme, if our segmentation were to suggest asingle
segment for each pixel in the image, we could alwaysreconstruct the
object perfectly by selecting those segments(or in this case,
pixels) which corresponded exactly to theground-truth object’s
segmentation. But this nearly-perfectconsistency would come at the
cost of an unacceptably highnumber of constituent segments, as
indicated in the right-most example in Figure4. At the opposite
extreme, oursegmentation could offer a single segment which covers
theentire object, as shown in the leftmost example. In this case,we
would achieve the desired minimum of one segment tocapture the
whole object, but with very poor consistency.
Number ofSegments Required
13
6
250+
Inreasing Segmentation Consistency
Figure 4. The tradeoff between segmentation consistency (or
ac-curacy) and efficiency (or the number of segments required
toachieve that consistency). As desired consistency increases, so
toodoes the number of segments required to achieve that
consistency.
Thus we see that there exists a fundamental tradeoff be-tween
consistency and the number of segments needed tocover the object,
orefficiency.3 Therefore, when evaluat-
3It may be helpful to think of these measures and their tradeoff
as beingroughly analogous to the common notions of precision and
recall, e.g., inobject recognition.
ing the quality of a scene’s (over-)segmentationS, we musttake
into account both measures. We define the efficiencyas the size of
the minimal subset of segments,R, requiredto achieve a specified
desired consistency,cd, according tothe ground truth object
segmentation,G:
ecd(S,G) = min |R|, such thatc(R,G) ≥ cd. (4)
Note that with this definition, alower value ofe(S,G) im-plies
amoreefficient (or parsimonious) segmentation.
We can now specifycd and compute the correspondingecd , which is
equivalent to asking, “what is the minimumnumber of segments
required to achieve the desired consis-tency for each object in
this scene?” By asking this questionfor a variety of consistency
levels, we can evaluate the qual-ity of a particular method and
compare it to other methods’performance at equivalent operating
points.
Note that we can avoid a combinatorial search over allsubsetsR
possibly required to achieve a particularcd byconsidering only
those segments which overlap the groundtruth object and by imposing
a practical limit on the numberof segments we are willing to
consider (selected in order oftheir individual consistency
measures) [27].
Referring once again to Figure4, the middle two ex-amples
indicate that we can achieve a reasonable level ofconsistency with
only three segments, and if we raise thedesired consistency a bit
higher (perhaps in order to cap-ture the top of the mug), it will
require us to use a differentsegmentation from our set which covers
the mug with sixsegments. In general, the best method would be
capable ofproviding at least one segmentation which yields a
desiredhigh level of consistency with the minimum degree of
over-segmentation of any particular object.
6. Experiments
For each of a set of test scenes, we have labeled theground
truth segmentation for foreground objects of inter-est, which we
roughly defined as those objects for whichthe majority of their
boundaries are visible. Since we em-ploy [26]’s method for boundary
information, we also usetheir online dataset of 30 scenes. From
those scenes, wehave labeled ground truth segmentations for 50
foregroundobjects on which we will evaluate our approach.
We generate a set of segmentations for each scene byvaryingK
between 2 and 20 while using either our matting-based approach,
multiscale NCuts [7], or the BSE ap-proach [8, 13]. Recall that the
two latter methods com-pute affinities in apixelwisefashion. To
verify that anyimprovement offered by our approach is notsolelydue
toour use of super-pixels, we also implemented a baselineapproach
which computes affinities from pairwise compar-isons of L*a*b*
color distributions within each super-pixel,using
theχ2-distance.
For our proposed approach, we use each of the individ-ual
boundary fragments from [26] to suggest constraints formattes as
described in Section2.2. In practice, we ignore
-
those fragments with anextremelylow probability (< 0.01)of
being boundaries, since the resulting mattes in thosecases would
have almost no effect on computed affinitiesanyway. From an initial
set of 350-1000, this yields ap-proximately 90-350 fragments (and
mattes) per image forcomputing pairwise, super-pixel affinities
according to (2).
We first selected a set of ten desired consistency levels,from
0.5 to 0.95. Then for each labeled object in a scene andfor each
segmentation method, we pose the question, “whatis the minimum
number of segments required to achieveeach desired consistency in
segmenting this object?” Wecan then graph and compare the methods’
best-case effi-ciencies as a function of the desired
consistencies.
A typical graph is provided at the top of Figure5, inwhich bars
of different colors correspond to different seg-mentation methods.
Each group of bars corresponds to acertain consistency level, and
the height of the bars indi-cates the minimum efficiency (i.e.
number of segments re-quired) to achieve that consistency.
Thus,lower bars arebetter. Bars extend to the top of the graph when
a methodcould not achieve a desired consistency withanynumber
ofsegments. Thus we see that our approach is able to achievesimilar
consistency with fewer segments — untilcd reaches0.85, at which
point all methods fail on this image. Not sur-prisingly, the
relatively simplistic appearance model of thecolor-only baseline
tends to over-segment objects the most.
We can also examine the actual segmentations producedby each
method at corresponding desired consistency lev-els, as shown at
the bottom of Figure5. For the input imageand selected ground truth
object shown we provide for eachmethod the segmentations which use
the minimum numberof segments and achievedat leastthe desired
consistencyindicated to the left of the row. Also shown are the
super-pixels used for our method and the color-distribution
ap-proach, along with a high-probability subset of the bound-ary
fragments used in the proposed approach. (We displayonly a subset
of the fragments actually used simply for clar-ity.) Below each
segmentation are the actual consistenciesand efficiencies (i.e.
number of segments) obtained. Notehow our method achieves
comparable consistencies withfewer segments — even when other
methods may not beable to achieve that consistency at all. More
results are pro-vided in Figures6-7 and in the supplemental
material.
Not surprisingly, when the desired consistency is low,any method
is usually capable of capturing an object withfew segments. But as
the desired consistency increases, itbecomes more difficult, or
even impossible, to find a smallnumber of segments to recover that
object so accurately.Finally, as the desired consistency becomes
too high, allmethods begin to fail. We find, however, that for many
ob-jects our matting-based approach tends to maintain
betterefficiency (i.e. decreased over-segmentation) into a
higher-consistency regime than the other methods.
To capture this more quantitatively over all objects, wecan
compute the difference between the number of seg-
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950 1 2 3 4 5 6 7 8 9
10
Not Possible
Desired Object Segmentation Consistency
# S
egm
ents
Req
uire
d
Segments Required to Achieve Specified Consistency
Color Distribution AffinityMultiscale NCutsBerkeley Segmentation
Engine (BSE)Proposed Matting−Based Affinity
Input Image Ground Truth Object Super−PixelsHigh−Probability
Boundary Fragments
c=0.71, ec=5.00
c d ≥
0.7
0
ColorDistribution
c=0.79, ec=3.00
Multiscale NCuts
c=0.71, ec=2.00
BSE
c=0.71, ec=1.00
Our ProposedMatting Affinity
c=0.81, ec=9.00
c d ≥
0.8
0Not Possible Not Possible c=0.81, ec=3.00
Figure 5.Top: A typical graph of segmentation efficiency vs.
con-sistency for a set of desired consistency levels. Our method is
ableto achieve similar object segmentation consistency with fewer
seg-ments.Bottom: The corresponding input data (first row) and
theresulting segmentations at two consistency levels (2nd, 3rd
rows),as indicated bycd = {0.70, 0.80} to the left of each row.
Forclarity, only high-probability boundary fragments used by our
ap-proach are shown, using a different color for each fragment.
Forvisualization, the individual segments corresponding to the
object,i.e. those used to compute thec and e values displayed
beloweach segmentation, have been colored with a red-yellow
colormap,while background segments are colored blue-green.
ments our method requires at each consistency level and
thenumber required by the other methods. We would like tosee that
our method often requires significantly fewer seg-ments to achieve
the same consistency. Certainly, there aresome “easier” objects for
which the choice of segmentationmethod may not matter, so we would
also expect to find thatour method regularly performs only as well
as other meth-ods. But we also must ensure that we do better much
moreoften than worse. Furthermore, we expect the potential ben-efit
of our approach to be most obvious within a reasonablerange of
desired consistency: all methods will tend to per-form equally well
at low consistencies, and all methods willtend to fail equally
often at very high consistencies.
Figure8 offers evidence that our approach does indeedoutperform
the other methods as desired. As expected, weperform just as well
(or poorly) as other methods for manyof the objects, particularly
at very low or very high desiredconsistency. But we require several
fewer segments per ob-ject in a significant number of cases.
Furthermore, our ap-proach rarelyhurts; we do not often
requiremoresegmentsthan the other methods.
-
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
1
2
3
4
5
6
7
8
9
Not Possible
Desired Object Segmentation Consistency
# S
egm
ents
Req
uire
dSegments Required to Achieve Specified Consistency
Color Distribution AffinityMultiscale NCutsBerkeley Segmentation
Engine (BSE)Proposed Matting−Based Affinity
Input Image Ground Truth Object Super−PixelsHigh−Probability
Boundary Fragments
c=0.74, ec=5.00
c d ≥
0.7
0
ColorDistribution
c=0.83, ec=7.00
c d ≥
0.8
0
c=0.72, ec=1.00
Our ProposedMatting Affinity
c=0.83, ec=3.00
c=0.72, ec=6.00
Multiscale NCuts
Not Possible
c=0.71, ec=2.00
BSE
c=0.82, ec=4.00
Figure 6. Another example showing our method’s ability toachieve
comparable consistency with fewer segments.
Discussion & Conclusion
Debate continues over whether high-level reasoning,e.g.object
recognition, should precede figure-ground percep-tion [15], or if
purely bottom-up, local reasoning can ac-count for this process
[16, 25]. Our work takes the morebottom-up perspective, since
knowledge of specific objectsis not required (unlike [18]), but our
use of matting in-corporates more powerful, semi-global reasoning
as com-pared to purely-local methods. Our experiments indicatethat
by maintaining “soft” reasoning throughout, and bycombining
multiple, individually-imperfect sources of in-formation in the
form of fragmented boundary hypothesesand oft-uncertain mattes, our
novel method for addressingobject segmentation yields promising
results for use in sub-sequent work on unsupervised object
discovery or scene un-derstanding. While here we have evaluated our
affinities inisolation, as compared to existing methods, it is
certainlypossible thatcombiningmultiple types of affinities
wouldoffer further improvement.
Using boundaries and mattes as described simultane-ously implies
thegroupingandseparationof super-pixelson the same or opposite
sides of a given boundary, re-spectively. We performed preliminary
experiments withthe technique described in [33] to
incorporateseparatelysuch “attraction” and “repulsion” evidence via
spectral clus-tering, rather than simply treating the two as
equivalentsources of information with opposite signs. This
often
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950 1 2 3 4 5 6 7 8
9
10 Not Possible
Desired Object Segmentation Consistency
# S
egm
ents
Req
uire
d
Segments Required to Achieve Specified Consistency
Color Distribution AffinityMultiscale NCutsBSEProposed
Matting−Based
Input Image Ground Truth Object Super−PixelsHigh−Probability
Boundary Fragments
c=0.69, ec=2.00
c d ≥
0.6
5
ColorDistribution
c=0.65, ec=2.00
Multiscale NCuts
c=0.72, ec=3.00
BSE
c=0.84, ec=1.00
Our ProposedMatting Affinity
c=0.76, ec=10.00
c d ≥
0.8
0
Not Possible c=0.81, ec=4.00 c=0.84, ec=1.00
Figure 7. Another example like those in Figures5-6. Note onlyone
of the three objects in the scene is shown here.
yielded very good results, but it was fickle: when it failed,it
often failed completely. Future research in this directionis
certainly warranted, however.
References
[1] E. H. Adelson and J. R. Bergen.The plenoptic function andthe
elements of early vision, chapter 1. The MIT Press, 1991.
[2] N. E. Apostoloff and A. W. Fitzgibbon. Bayesian video
mat-ting using learnt image priors. InCVPR, 2004.
[3] N. E. Apostoloff and A. W. Fitzgibbon. Automatic
videosegmentation using spatiotemporal T-junctions.
InBMVC,2006.
[4] X. Bai and G. Sapiro. A geodesic framework for fast
inter-active image and video segmentation and matting.
InICCV,2007.
[5] Y. Boykov and M. P. Jolly. Interactive graph cuts for
optimalboundary and region segmentation of objects in N-D images.In
ICCV, 2001.
[6] Y. Chuang, B. Curless, D. Salesin, and R. Szeliski.
Abayesian approach to digital matting. InCVPR, 2001.
[7] T. Cour, F. B́eńezit, and J. Shi. Spectral segmentation
withmultiscale graph decomposition. InCVPR, 2005.
[8] C. Fowlkes, D. Martin, and J. Malik. Learning affinity
func-tions for image segmentation: Combining patch-based
andgradient-based approaches. InCVPR, 2003.
[9] D. Hoiem, A. A. Efros, and M. Hebert. Recovering
surfacelayout from an image.IJCV, 75(1), October 2007.
[10] T. Leung and J. Malik. Contour continuity in region
basedimage segmentation. InECCV, June 1998.
-
= +50
0.2
0.4
0.6
0.8
1Desired Consistency of 0.60
Fra
ctio
n of
Obj
ects
Improvement Summary:For 27% of the objects, we requireaverage of
2.5 FEWER segments
Degradation Summary:For 12% of the objects, we require average
of 1.6 MORE segments
Performance of "Matting: Simple NCuts" vs. Other Approaches
= +50
0.2
0.4
0.6
0.8
1Desired Consistency of 0.70
Fra
ctio
n of
Obj
ects
Improvement Summary:For 42% of the objects, we requireaverage of
3.4 FEWER segments
Degradation Summary:For 11% of the objects, we require average
of 1.1 MORE segments
= +50
0.2
0.4
0.6
0.8
1Desired Consistency of 0.80
Fra
ctio
n of
Obj
ects
Improvement Summary:For 58% of the objects, we requireaverage of
4.0 FEWER segments
Degradation Summary:For 7% of the objects, we require average of
1.7 MORE segments
Relative Number of Segments Required by Other Method
As Compared to: Color Distribtions Multiscale NCuts BSE
Improvement Degradationcd ∆e % Obj. −∆e % Obj.
0.50 2.0 21% 1.3 3%0.55 2.3 26% 1.4 6%0.60 2.5 27% 1.6 12%0.65
2.7 34% 1.2 10%0.70 3.4 42% 1.1 11%0.75 3.8 47% 1.9 11%0.80 4.0 58%
1.7 7%0.85 3.6 39% 2.1 12%0.90 4.2 30% 3.5 9%0.95 4.7 11% 3.4 5%For
each desired consistency levelcd, we in-dicate the percentage of
objects for whichwe offer improvement over other methodsand how
manyfewer segments we requireon average (∆e) — higher values are
bet-ter for both. Similarly, for cases where wedo worse, we
indicate how manymoreseg-ments we require (−∆e) and how often
—lower values are better here.
Figure 8.Overall Performance. At left are histograms of
therelativenumber of segments required by our approach as compared
to theother methods. The height of the bars corresponds to the
fraction of thetotal number of objects for which we achieve the
specified relativeefficiency on thex-axis. Thus, bars at zero, in
the center of the graph, correspond to cases when we perform just
as well as the otherapproaches. Bars to the right (left) correspond
to cases where we perform better (resp., worse), using fewer
(resp., more) segments thanthe competition. The table and the
insets on the graphs provide summary statistics at the specified
consistency levels.
[11] A. Levin, D. Lischinski, and Y. Weiss. A closed form
solu-tion to natural image matting. InCVPR, 2006.
[12] T. Malisiewicz and A. A. Efros. Improving spatial
supportfor objects via multiple segmentations. InBMVC, 2007.
[13] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to
detectnatural image boundaries using local brightness, color,
andtexture cues.POCV, 26(5):530–549, May 2004.
[14] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral
clustering:Analysis and an algorithm. InNIPS, 2001.
[15] M. A. Peterson and B. S. Gibson. Must figure-ground
orga-nization precede object recognition? An assumption in
peril.Psychological Science, 5(5):253–259, September 1994.
[16] F. T. Qiu and R. von der Heydt. Figure and ground in
thevisual cortex: V2 combines stereoscopic cues with
gestaltrules.Neurons, 47:155–166, July 2005.
[17] A. Rabinovich, A. Vedaldi, and S. Belongie. Does
imagesegmentation improve object categorization?
Cs2007-0908,University of California San Diego, 2007.
[18] X. Ren, C. C. Fowlkes, and J. Malik. Figure/ground
assign-ment in natural images. InECCV, 2006.
[19] X. Ren and J. Malik. Learning a classification model
forsegmentation. InICCV, volume 1, pages 10–17, 2003.
[20] C. Rother, V. Kolmogorov, and A. Blake. “grabCut”:
Inter-active foreground extraction using iterated graph
cuts.SIG-GRAPH, 23(3):309–314, 2004.
[21] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, andA.
Zisserman. Using multiple segmentations to discover ob-jects and
their extent in image collections. InCVPR, 2006.
[22] M. Ruzon and C. Tomasi. Alpha estimation in natural
im-ages. InCVPR, 2000.
[23] J. Shi and J. Malik. Motion segmentation and tracking
usingnormalized cuts. InICCV, pages 1154–1160, 1998.
[24] J. Shi and J. Malik. Normalized cuts and image
segmenta-tion. POCV, 22(8):888–905, August 2000.
[25] L. Spillmann and W. H. Ehrenstein. Gestalt factors in
thevisual neurosciences. In L. M. Chalupa and J. S.
Werner,editors,The Visual Neurosciences, pages 1573–1589. MITPress,
Cambridge, MA, November 2003.
[26] A. Stein, D. Hoiem, and M. Hebert. Learning to find
objectboundaries using motion cues. InICCV, 2007.
[27] A. N. Stein.Occlusion Boundaries: Low-Level Processing
toHigh-Level Reasoning. Doctoral dissertation, The
RoboticsInstitute, Carnegie Mellon University, February 2008.
[28] J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum. Poisson
matting.SIGGRAPH, 23(3):315–321, 2004.
[29] J. Tang and P. Lewis. Using multiple segmentations for
im-age auto-annotation. InACM International Conference onImage and
Video Retrieval, 2007.
[30] D. A. Tolliver and G. L. Miller. Graph partitioning by
spec-tral rounding: Applications in image segmentation and
clus-tering. InCVPR, 2006.
[31] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward
ob-jective evaluation of image segmentation
algorithms.POCV,29(6):929–944, June 2007.
[32] J. Xiao and M. Shah. Accurate motion layer segmentationand
matting. InCVPR, 2005.
[33] S. Yu and J. Shi. Segmentation with pairwise attraction
andrepulsion. InICCV, July 2001.
[34] S. X. Yu and J. Shi. Multiclass spectral clustering.
InICCV,2003.