-
Multimed Tools ApplDOI 10.1007/s11042-012-1079-z
Concurrent photo sequence organization
Liliana Lo Presti · Marco La Cascia
© Springer Science+Business Media, LLC 2012
Abstract Personal photo album organization is a highly demanding
domain whereadvanced tools are required to manage large photo
collections. In contrast to manyprevious works, that try to solve
the problem of organizing a single user photosequence, we present a
new technique to account for the concurrent photo
sequenceorganization problem, that is the problem of organizing
multiple photo sequencestaken during the same event. Given a set of
sequences acquired at the same placeduring the same temporal window
by several users using different cameras, ourframework is intended
to capture the evolution of the event and groups photosbased on
temporal proximity and visual content. The method automatically
organizesthe reference sequence in a tree capturing the event
structure. Such a structure isthen used to align the remaining
photo sequences to the reference one. We testedour approach on the
publicly available Gallagher dataset and on a new dataset
wecollected; this new dataset is composed of four photo sequences
taken by four usersat a public event. Results demonstrate the
effectiveness of our method.
Keywords Digital library · Personal photo album · Concurrent
photos ·Co-organization · Content analysis · Hidden Markov
Model
This research has been conducted while Dr. Lo Presti was
post-doctoral researcherat University of Palermo.
L. Lo Presti (B)Computer Science Department, Boston University,
Boston, MA, USAe-mail: [email protected]
M. La CasciaDICGIM, University of Palermo, Palermo, Italye-mail:
[email protected]
-
Multimed Tools Appl
1 Introduction
Nowadays cameras are commonplace and large photo collections are
collected andneed to be properly organized. Much of the research in
this domain focuses on theproblem of managing personal photo
collections acquired by a single user, takingadvantage of
information such as who is in the photo, and when and where
eachphoto has been acquired [5, 10, 19, 28]. Photo organization
based on the identitiesof faces detected in the collection is often
performed using clustering methods andasking to the user to tag the
representative faces or to select from a list of probablelabels as
in Choi et al. [4] and Gallagher and Chen [9]. Recently, the
problem hasalso been modeled as a data association problem in Lo
Presti et al. [20] and Zhanget al. [29].
Organization of photo collections can take advantage of
contextual information,such as photo timestamp and/or geo-reference
information coming, for example,from GPS equipped devices. Some
methods perform visual content-based photoorganization [13, 17]. In
this case, however, where the photo has been collected doesnot
refer to geo-reference data but to a particular situation when the
photo has beenacquired, for example outdoor or indoor, the park,
the sea, etc.
In this paper, we focus on concurrent photo sequence
organization, that is theproblem of organizing multiple photo
sequences detected at the same place duringthe same temporal window
by several users with different cameras. This problem hasreceived
few attention from the scientific community, but tools based on the
ideaswe present could be very useful in practice.
In fact, the scope of taking and managing photos is changing;
people desire tocapture important moments of their life and share
such moments with the others.Indeed, photo-sharing has been one of
the most popular applications recently. Thenumber of photo-sharing
applications on smart phones is growing fast and there isstrong
demand for them. Moreover, social networks have gained much
popularityin recent years and are often used to share photos among
friends. In particular, incase a group of “friends” has been
involved in a social event, they probably wouldlike to share all
the photos taken at the event; in this case, a proper
organizationof these photos is required. Tools for organizing
several photo sequences of thesame event are currently missing both
in mobile photo sharing applications and insocial networks like
Facebook [7] or Google+ [11]. Nonetheless, concurrent photosequence
sharing is a very interesting scenario that enables the use of
collectiveknowledge for photo collection analysis and management
[25]. This is the mainmotivation for our work: providing a
framework to co-organize and help users tobrowse photos taken
concurrently at a social event within their social group.
Events are the key concepts often used to organize photos in
album. In this paper,we focus on the event structure of the shared
sequences; we assume that, as thephotos are taken during the same
event, it is likely that all the sequences wouldhave a similar
event structure. Therefore, in our framework the temporal and
visualstructure of a chosen sequence is learned, and then
transferred to the remainingunprocessed sequences.
We consider the case when the photo sequences are not fully
overlapping andcameras are not temporally synchronized. In such a
scenario, temporal informationis not very helpful to organize the
sequences. Only the time difference betweensuccessive photos can
reliably be used. Moreover, other information related to thevisual
content can be effectively used to find correspondences among
sequences.
-
Multimed Tools Appl
There are several reasons why the problem of concurrent sequence
organizationis difficult. First, the set of possible points of view
from which photos are taken isvery large. Second, the users can
focus only on particular aspects of the scene basedon personal
preferences. It is possible that two users acquiring photos in the
samemoment and place will focus on completely different
aspects/objects taking photosthat do not look similar. Third, the
scene to acquire is hard to represent, and there islittle evidence
about what needs to be measured to obtain a good description of
eachphoto.
Other challenges should be considered, for example, illumination
changes anddifferent sensor characteristics that can affect the
quality of the photos. Most im-portant, each user can have his/her
own preferences while setting the cameras (flashon/off, zoom,
micro-utility...). All these factors make the problem of
concurrentsequence organization very difficult and open new
interesting research directions toinvestigate in the future.
The simplest approach to solve the sequence co-organization
problem would bethat of considering all the photos together and
using a clustering technique to findgroups of photos that look
similar. However, such organization could perform poorlybecause of
the high variance of the photo content, and would not consider at
all thetemporal order of the photos along the sequence that,
instead, is an important clue touse. Each photo sequence can be
seen as a story with a temporal structure. All thesestories are
partially overlapping and our goal is that of detecting which parts
of suchstories overlap and properly integrating them to get a
unique and more completestory.
In our framework, one of the sequences (the longest one or the
one selected bythe user) is set as reference and the remaining
sequences are organized accordingly.Our contribution is twofold: we
present a method to organize a sequence of photoshierarchically to
capture the structure of the event considering time information
andvisual content. Then, the temporal structure and evolution of
the event is capturedby a Hidden Markov Model (HMM) that serves to
classify photos belonging to theremaining sequences.
The organization of the paper is as follows: in Section 2, we
present previous worksabout photo collection organization, while in
Section 3, we define the problem. InSection 4 we present our method
to organize the reference sequence in a tree. InSection 5 we
describe the probabilistic framework we used to determine the
align-ment among concurrent photo sequences. Finally, in Sections 6
and 7, we describedatasets and experimental results, and discuss
conclusions and future directions.
2 Related works
In recent years, many works focusing on photo collection
management were pro-posed enabling users to share, browse, and
search their own photos. In Gong andJain [10], many aspects of
photo segmentation are pointed out as, for example, thepossibility
to make use of contextual information to organize the collections
and usewhere and when photos have been taken to easily browse them.
In particular, theauthors suggest that EXchangeable Image File
(EXIF) data and/or a priori knownevent-model can provide useful
insights to understand and represent the structure ofa photo
stream.
-
Multimed Tools Appl
Many approaches were based on partitive clustering techniques
for classifyingimages into a default number of groups. However such
techniques, derived fromclassic image retrieval studies, do not
allow to obtain suitable results when appliedto personal photo
collections. In Ardizzone et al. [1], mean-shift clustering is
usedto automatically organize image data focusing on faces,
background, and timeinformation. Data organization does not need
any human intervention since imagefeatures are automatically
extracted, and clustering parameters are automaticallydetermined
according to an entropy based figure of merit. However, the
eventstructure to which photos refer and that could be useful to
the user for browsing thecollection, does not completely emerge
because no temporal analysis is performedon the sequence.
Some methods focus on organizing photos based on who is in the
picture. In LoPresti et al. [20], a data association problem is set
to group faces belonging to thesame identity in order to ease the
user’s tagging task. A probabilistic framework is setto find
correspondences based on online learned face and clothing models
estimatedfor each identity. In Lo Presti et al. [21], the same
method has been extended toconsider also time information.
In Zhang et al. [29], users are allowed to multi-select a group
of photos and assign aname/tag to the person appearing in them. The
method attempts to propagate namesfrom photo level to face level,
i.e. to infer the correspondence between name andface. However,
whilst the user’s effort for tagging is minimized, still the user
has tomanually identify the group of photos where a person appears.
Moreover, in somecases the method is not able to disambiguate
between persons in the photos (i.e.,when some persons always appear
together in the set of photos).
In Li et al. [17], photo collections are organized based on
image content. Colorhistograms of faces, clothing and background
are used as image content features. Asimilarity matrix for the
photos in the collection is computed according to temporaland
content features; then, hierarchical clustering is used to group
similar photos.The contrast context histogram (CCH) technique is
used to properly summarize eachcluster.
All these methods, whilst they are effective for easing the
photo collectionbrowsing, do not implement any capability to really
understand both the photosequence structure and its time evolution.
In this paper, given a set of sequencestaken at the same event, we
propose to transfer learning about the temporal andvisual structure
of a chosen sequence to the remaining unprocessed ones,
consideringthat it is likely all the sequences share the same event
structure. At the best ofour knowledge, such problem has been faced
only in Jang et al. [14]. They clusterconcurrent photos by first
selecting a preferred sequence and then estimating, viatemporal
analysis, the basis clusters that will be used as a reference model
for theremaining sequences. Then, based on the user’s preferences,
photos are iterativelyclustered considering temporal and visual
information until the user’s preferencesare not satisfied. Photos
were grouped by a hierarchical clustering method combiningtime and
visual similarity.
On the contrary, our work uses clustering techniques only to
organize a referencephoto sequence while all the other sequences
are processed by means of theestimated temporal model. We used a
Hidden Markov Model to explicitly representthe temporal model of
the reference sequence and to infer correspondences with thephotos
belonging to the other sequences. In this way, it is not required
that camerasare time synchronized. In contrast to the work in Jang
et al. [14], where the main
-
Multimed Tools Appl
goal is to satisfy the user preferences, we propose a general
unsupervised strategy toestimate the structure and dynamics of an
event, and we show how to transfer suchinformation on the other
sequences. Under this point of view, our method is closerto event
understanding than to photo clustering.
3 Problem definition
Let us consider a photo sequence Sr chosen as reference
sequence, and a set Cof N sequences C = {S1, S2, ..., SN} all
related to the same event, that is all thesequences were collected
during the same temporal window and at the same placeby different
users. Concurrent photo sequence organization aims to group
photosfrom different sequences into clusters with meaningful
“visual semantics”, that iswith similar content. For example, if
the sequences are related to a birthday party,meaningful moments
could be when candles on the cake are blown off or whenpresents are
opened. If the event is a marriage, then possible meaningful
clusterswould regard the moment when the bride and groom enter the
church, or when theycome out, and so on.
In this paper, we propose to organize the photos of a reference
sequence in a treeto represent the event structure; then, we model
the dynamics occurring during theevent itself by means of a Markov
chain and use such a dynamic model to classifythe pictures
belonging to the other sequences. Of course, several methods may
beused to infer the tree structure or model the event dynamics.
Here, for the sake ofdemonstrating how our idea may be applied, we
present specific implementationsfor modeling the event structure
and dynamics, but we believe that other techniquesmay be selected
instead.
In our implementation, the reference sequence Sr is organized in
a tree whosenodes, at each level, represent clusters of photos
computed considering some peculiarcharacteristic. Nodes at the
first level are computed considering temporal informa-tion coming
from photo timestamps in order to group photos acquired “near in
time”.In the following, we call a node at this level situation.
Photos within each situation are then clustered based on visual
similarities tohave a more accurate photo organization. We use hue
and saturation values torepresent the visual content trying in this
way to account for illumination changes.To group similar photos, we
used a clustering technique that does not require anya priori
information about the collection. In this way, nodes at the second
levelof the tree represent groups of photos that look similar
because of both the colordistribution and their temporal proximity.
In the following, we refer to nodes at thislevel as
content-cluster. Of course, other features could be used to
represent contentinformation, and many other levels could be added
to the tree based on the propertythat it is necessary to highlight.
For example, the objects that are in each photo or thefaces that
are detected could be used to generate a new level in the tree.
The tree computed for the sequence Sr is used to classify the
photos belongingto the remaining sequences in the set C. As we are
not imposing that sequences aretime synchronized, we cannot use
temporal information to perform the alignment.We instead need to
classify the photos based on their visual content. However, inthis
context, the temporal order of photos is meaningful. If the event
has a temporalstructure (as happen for a birthday party or a
marriage), then this structure could
-
Multimed Tools Appl
emerge completely or partially from each sequence and can be
used to find an align-ment between them. For this reason, we used a
first order Hidden Markov Model(HMM) to represent the structure of
the event with respect to the reference photosequence, and infer
how the remaining sequences can match the same temporalorder. In
practice, photos taken in a certain sequence Si are treated as
observationsin the model while, as we will explain later, the
“content” clusters detected for thereference sequence Sr are used
to define the possible states that the hidden variablescan assume.
Photos in the sequence to classify are considered in time order
and,while performing inference, the transition matrix of the Markov
chain is computedin such a way that the temporal sequence adapts to
the temporal structure of thereference sequence. The
consecutiveness of the situations discovered in sequenceSr must
hold also for the other sequences. Sequences in the set C are
processedindividually. In Fig. 1, we present a scenario with three
photo sequences. Victor’sphotos are used as reference sequence and
are organized in a tree. The leaf nodesare used to classify and
organize Alice and Bob’s photos. While performing
suchclassification, the temporal order estimated for Victor’s
photos is used to organizethe remaining sequences. However, some
photos could not be classified (see Nulllinks in Bob’s sequence)
because the photo content can largely differ from the one
inVictor’s photos.
Alice's Photos
Bob's Photos
Null Null
REFERENCE
SEQUENCE
ORGANIZATION
CO
ORGANIZATION
SEQUENCE
Fig. 1 Schema of the proposed framework: the reference sequence
is organized in a tree consideringtemporal proximity and visual
content; the tree is used to classify the photos belonging to
theremaining sequences
-
Multimed Tools Appl
4 Hierarchical sequence organization
Given a photo sequence Sr, our goal is to organize it
considering when and whereeach photo has been acquired by using
context and content. Our method organizesphotos in a tree where
each node represents a cluster of photos with similar
features.Figure 2 shows part of the tree we got for a sequence of
our dataset.
This kind of organization is somewhat related also to the
problem of automaticvideo segmentation where three steps are
generally performed. The first step isshot boundary detection
(SBD), that is the task of identifying similar consecutiveframes.
The second step is keyframe selection that extracts one or more
frames torepresent the shot. Finally, scene segmentation groups
together related shots [26]. Ashot groups together frames taken in
a certain temporal window and, therefore, itlooks similar to a
situation. However, shot segmentation is performed
consideringvisual properties of consecutive frames. In our problem,
instead, situations arefound considering only time information.
Keyframes are instead conceptually similarto our content-clusters.
In video segmentation, visually dissimilar keyframes areselected to
represent a shot. In our problem, a set of similar photos in the
samesituation are grouped and used to estimate a probabilistic
model representing their“appearance”. Several of these models are
then used to represent the content of asituation.
Fig. 2 The image shows some nodes of the estimated tree for the
reference sequence. Photos withinthe ellipse have been taken in the
same situation; they are then grouped based on visual
similarity
-
Multimed Tools Appl
4.1 Time segmentation
Given a time ordered photo sequence, our goal is to isolate all
those pictures thatwere acquired within the same temporal window,
that is in the same situation. Apossible solution to find these
situations would be to use clustering methods to findall the photos
with a similar timestamp. Such methods generally require to set
apriori the number of clusters (i.e. k-means) or a similarity
threshold (i.e. hierarchicalclustering). In the last case, the
problem is challenging because the threshold affectsthe grain at
which clusters are computed and a proper threshold is difficult to
setwhen the sequence has been acquired in a short temporal window.
The risk is that ofdetecting too many short situations (in the
worst case any photo can be consideredas a cluster) or too long
temporal windows mixing different “situations”.
Instead of using directly the timestamp value to group photos,
we take a differentapproach similar in spirit to the one presented
in Cooper et al. [6]. Temporalsegmentation is performed considering
the difference δk of a photo timestamp Tp(k)with the next Tp(k +
1), computed as follows:
δk = Tp(k + 1) − Tp(k); (1)
where k indexes the photos along the time ordered sequence. The
set of computeddistances may refer to intra-situation δks and
inter-situation distances δ
ki .
Figure 3 shows the plot of such differences along the ordered
timestamps for asequence of 200 photos taken during a period of
almost 10 h. Peaks correspondto large time differences and can be
considered as the starting of possible newsituations. In practice,
the lower is the time difference between two subsequentphotos, more
probable is that the two photos belong to the same situation.
Thetemporal segmentation problem reduces then to detect the
“meaningful peaks”. Ifthe number of situations along the sequences
would be a priori known, say K, then itwould be sufficient to take
the highest K − 1 peaks for segmenting the sequencein situations.
However, in general such information is not known a priori and
athreshold to detect the peaks is required (Fig. 4).
Fig. 3 Plot of the timestampdifferences of consecutivephotos
along the time orderedsequence. Peaks show where anew situation
starts
0 20 40 60 80 100 120 140 160 180 2000
1000
2000
3000
4000
5000
6000
7000
8000
9000
Photo Indexes
Tim
esta
mp
Dif
fere
nce
s (s
eco
nd
s)
-
Multimed Tools Appl
Fig. 4 Situations detected for the first 15 photos of a photo
sequence. The time is referred to thefirst photo in the sequence.
Each situation is generally composed of a different number of
photosshowing a large variance of the visual content
In contrast to Cooper et al. [6], where a multi-scale analysis
is applied to determinethe peaks, we model the probability density
function of the distances. We assumethe distances between two
successive photos follow two Gaussians, depending onwhether the
photos belong to the same situation. To automatically determine
asuitable threshold and separate intra-situation and
inter-situation distances, we traina mixture of two Gaussians by
maximizing the likelihood of the data within
theExpectation-Maximization framework.
To limit the effect of outliers, our technique uses a robust
estimator to learn theparameters, as described in Li [18]. During
the training, we consider the minimum ofa default threshold and the
distance between the Gaussian mean and the trainingsample. We set
such threshold equals to the number of seconds within a day.This
choice is reasonable for general personal photo collections. The
GaussianMixture Model (GMM) estimation provides a clustering of the
samples in twoclasses (intra-situation distances {δks } and
inter-situation distances {δki }). The inter-situation distances
permit to detect the beginning of a new situation in the
timestampsequence. The threshold providing the same clustering
found through the GMM canbe chosen in the interval [maxk{δks }, min
j{δ ji }]. Therefore, we set the threshold T as:
T = 12
·(
maxk
{δks
} + minj
{δ
ji
}). (2)
-
Multimed Tools Appl
To find situations at the first level of the tree, the sequence
of timestamp differencesis sequentially analyzed. Every time the
distance is greater than the threshold T, anew situation has been
discovered and used as node of the tree.
The grain of the temporal segmentation depends on the threshold
T and, if afiner segmentation is required, it is possible to
decrease the value of this threshold.However, the automatically
computed threshold permits the method to adapt to thetemporal
duration of the collection. As the data are unidimensional, the
Expectation-Maximization procedure used to train the GMM is fast
and efficient.
Figure 5 shows the estimated GMM for the example sequence. As
the plot shows,the two modes are well separable. The most peaked
one corresponds to the modeof intra-situation distances, while the
other one to the inter-situation distances. It ispossible to note
that the latter mode is quite flat. This is due to the large
variance inthe inter-situation distance mode. The plot also shows
the position of the estimatedthreshold that, for this photo
sequence, is of 266 s and provides 18 different situations.In Fig.
4 we show the events detected for the first 15 photos.
Algorithm 1 summarizes the main steps of our technique to
organize photosin situations without any a priori information about
the collection. The functiontrain_GMM simply trains a mixture of
two Gaussians and return the two sets δs andδi of inter-situation
and intra-situation distances.
4.2 Content-based organization
4.2.1 Content representation
As Fig. 4 shows, the visual content of the photos can largely
change within thesame situation, and different clusters could
emerge considering the photo similarities.Representing the content
of the photos with the goal of establishing matches amongthem is a
difficult task because of the large changes of points of view,
different camerasettings, illumination changes and so on. In some
works [15], matching betweenimages are established by extracting
interest points in the images, for instance SIFT[22] and SURF [2],
and then finding correspondences. However, when these methodsare
adopted, generally it is a priori assumed that the content in the
two images isoverlapping (the images have some content in common).
In our case, this knowledge
-
Multimed Tools Appl
Fig. 5 Plot of the Mixture oftwo Gaussians representingthe
intra-situation andinter-situation distances forthe sequence of 200
photos.The threshold T isautomatically computed by(2). The range of
distances hasbeen limited to [0, 900] toenhance the
visualization
0 100 200 300 400 500 600 700 800 9000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Distances
Mixture of two Gaussians to represent the distance
distribution
Inter-SituationDistance Mode
Intra-SituationDistance Mode
Threshold T
is not a priori given. In other works, interesting points are
used in a bag of wordsparadigm [8]. In this case, learning a
vocabulary of words for describing the imagecontent introduces some
loss of information due to quantization or binarization ofthe
descriptors. Moreover, such approach requires a suitable training
set to learn thevocabulary, and is generally used to learn
class-models in a supervised way; in ourproblem, no supervision is
a priori available.
To build the third level of the tree representing the reference
sequence, thevisual content of each photo has been represented by
means of the distributionof hue and saturation in the HSI space
within maximally stable extremal regions(MSER) [23]. For each
image, MSERs are detected. Pixels inside each MSER areclosed under
perspective transformation of image coordinates and under
monotonictransformation of image intensities. Indeed, MSERs are
computed considering theconnected components detected for a certain
range of possible thresholds of the gray-level image. As a result,
these regions are invariant to affine transformations of theimage
intensities, and are stables as they are computed in correspondence
of thoseextremal regions unchanged over a range of thresholds. In
order to discard too finedetails and represent content in
meaningful area, we filter the regions discarding toolittle and too
large regions, and we use an incremental adaptive clustering
method(as in Leow and Li [16]) to filter regions with similar
position.
In our implementation, the MSER regions were computed by means
of the codeits authors made publicly available.1 We used the
default parameters for computingthe regions and then we analyzed
and filtered the detected regions. We retain onlythe regions whose
area is between 1% and 35% of the total image size. In
theagglomerative clustering, we grouped the regions whose centroid
Euclidean distanceis lower than 10. Such parameters have been tuned
manually to increase the precisionin retrieval of a sub-set of
images.
1The code may be found at http://www.featurespace.org/.
http://www.featurespace.org/
-
Multimed Tools Appl
We then represent the content by the global joint histogram of
hue and saturationvalues within the detected MSER-regions. We
adopted a uniform quantizationof the hue and saturation space with
8 bins per channel. The joint histogram isthen normalized and
transformed in a vector. This content representation
implicitlyassigns a greater weight to those pixels that fall in
overlapping MSERs, as theyprobably are more meaningful for the
content representation of the whole image.
Considering only hue and saturation makes the representation
more robust toillumination changes. The advantage in adopting a
histogram representation is thatthe representation is not dependent
on the structural and spatial image composition.In our context this
is particularly important because our final goal is to
comparephotos taken from very different points of view without
trying to estimate the realscene structure. Of course, it would be
possible to use other affine region detectors.A very useful
comparison among approaches at the state of the art has been
reportedin [24].
4.2.2 Mean-shift clustering
We use the MSER-based descriptor to cluster photos within the
same situation ingroups with similar visual content. It is not
possible to know a priori how manycontent-clusters have to be
computed within the same situation. Therefore, methodssuch as
K-means can not be adopted. We instead adopt the unsupervised
mean-shiftalgorithm proposed in Wu and Yang [27].
Mean-shift algorithm is a gradient descent based algorithm used
to computeautomatically the modes of the probability distribution
on the samples. Initially allthe points in the sample set are
candidates to be modes of the distribution. Then, bymeans of an
iterative procedure, such points are moved towards the true modes
of thedistribution. Mean-shift clustering suffers of the problem of
determining how muchfast such points must be moved. In Wu and Yang
[27], the authors propose to use thep − th order Epanechknikov
Kernel and the sample variance of the distances amongsamples as
size of the spatial window for the kernel. The method’s performance
isdependent on the order p of the kernel, representing the
stabilization parameter.They propose a technique to automatically
determine the best value of p. In practice,they define a function
to represent the shape of the estimated density distributiongiven
p; then they evaluate such function for a range of values p. The
best value of pis the one for which this shape remains unchanged
and can be evaluated based on thecorrelation value of the shape
function evaluated for each sample and for couples ofconsecutive
values p. The selected value of p will be the one for which the
correlationtends to be 1 (for more details see Wu and Yang
[27]).
In Fig. 6, each curve represents the plot of the correlation
computed for theMSER-based descriptors of the photos within the
same situation for different valuesof p. In our experiments, the
correlation did not always reach the value 1 (this caseis mentioned
in Wu and Yang [27]). Therefore, we empirically set a threshold for
thecorrelation values and choose p so that the correlation value
was greater than thisthreshold (in our experiments we used
0.995).
4.2.3 Content-based clustering
After running mean-shift clustering, samples have been shifted
toward the distribu-tion modes. As suggested in Wu and Yang [27],
to determine the final clusters, we use
-
Multimed Tools Appl
Fig. 6 Plot of the correlation ρcomputed for the shapefunction
for different values oforder p for the kernel. Eachcurve was
computed for adifferent segment
an agglomerative clustering method to group together all the
points returned by themean-shift method that have a distance lower
than a threshold τ . In our experiments,we set τ to 0.001; this
parameter does not affect the performance of the whole systemas
mean-shift provides well separated clusters. The points are
sequentially analyzed;every time the distance of a point from the
already discovered cluster centroids islarger than τ , a new
cluster is added. Each cluster is represented as a
multivariatenormal probability density with mean μ—computed
averaging all the samples in thecluster—and covariance �, set as
constant and diagonal, that is � = diag(σ 2), whereσ = 0.1. In case
the cluster is composed of only a single photo, the
correspondingMSER-based descriptor is used as mean.
In Fig. 2, the content-based clustering level shows how photos
are groupedconsidering the MSER-based descriptors and the
mean-shift clustering method.In practice, photos representing
similar points of view of the scene are groupedtogether, making
emerge a visual semantic in the set of photos.
-
Multimed Tools Appl
Algorithm 2 summarizes the main steps needed to obtain
content-clusters withineach situation. The function extract_MSER
aims to detect the MSERs of a certainimage s within the situation
S. Then, the regions are filtered to remove large or littleregions
and, by using an agglomerative clustering, to remove strongly
overlappingregions. Then, meanshift_clustering is used to compute
the modes of the distribution.The function agglomerative_clustering
detects the content-clusters by analyzing thepoints stored in shif
ted_samples.
5 Sequence co-organization
An event has an inherent temporality, and can be modeled as a
process that unfoldsin time. In practice, we consider photos taken
at a certain event as observations of aparticular state of the
event-process. We make the assumption that an event can bemodeled
as a Markov process, that is a time-varying process for which the
Markovproperty holds. In this kind of process, the present state
depends only on the paststates. In particular we adopt a first
order Markov chain. In our formulation, wehierarchically organize
the reference sequence obtaining a set of content-clusters.We use
each of these clusters to model a state of the event-process. The
order of thesituations which the content-clusters refer to, permits
to define the transitions amongstates for the whole process.
We adopt a Hidden Markov Models (HMMs) [3] to model the dynamics
of theprocess across time. In our framework, each photo to classify
is an observation.Therefore, the sequence of photos corresponds to
the sequence of observations.Given such sequence, our goal is to
infer the sequence of states that generated suchobservations. As a
state is a content-cluster in the event tree, inferring the
hiddenstates corresponds to classify each photo and associates it
with the correspondingcontent-cluster. Therefore the transition
probability model of the HMM permits totransfer the event time
structure from the reference sequence to the sequence
toclassify.
Figure 7 shows the graphical model associated with a first-order
discrete timeMarkov model, where the probability at t + 1 depends
only on the state inferredat time t. Here, t represents the index
of the observation (photo) to classify.
The state at any round t is denoted ct. In our formulation, each
ct represents theindex of a content-cluster, that is one of the
leaf node in the tree computed for thereference sequence Sr, as
described in Section 4. Each photo in the sequence toclassify is an
observation Ot and our framework infers the corresponding
content-cluster in the reference sequence.
Fig. 7 Graphical model for anHMM. At round i, the variableCi
represents the hiddendiscrete state, while Oirepresents the
observation
-
Multimed Tools Appl
The temporal model for the whole event is described by the
transition probabilitiesP(ct+1|ct), while the observation model
P(ot|ct) is defined in terms of probability ofobserving a certain
photo given the content cluster.
Given the reference sequence, the temporal structure of the
event is representedby the sequence of situations at the first
level of the tree estimated as describedin Section 4. However, we
can establish correspondences between sequences con-sidering only
visual features, if we assume that temporal information is
uncertainor missing for the sequence to classify. Based on how the
event tree has beenestimated, each situation may be represented by
several content-clusters. Thus,transition probabilities must be
computed guaranteeing the process properly unfolds.In practice,
from each content-cluster it is possible to transit only to the
content-clusters belonging to the same situation or to
content-clusters of situations forwardin time. Such transition
probability matrix models the dynamics of the event itself.We note
that event dynamics have never been modeled or considered in
previousworks about photo organization.
Let us indicate with Ck the set of content-cluster that refers
to the same situationk. At round t + 1, it is possible to transit
from state ct ∈ Ck to state ct+1 ∈ H,with H = ∪τ≥k{Cτ }. To enforce
such constraint we set the transition probability asfollows:
p(ct+1|ct) =⎧⎨⎩
1#H
if ct+1 ∈ H0 otherwise
(3)
where #(·) computes the number of content-clusters in H. On the
other hand, Hrepresents the set of content-clusters considering all
the situations subsequent to thek-th one. Equation 3 assures that
the choice of the content-cluster at time t + 1 isuniform over all
the reachable content-clusters from the content-cluster at time
t.
The probability of an observation given the state at round t has
been modeled bya multivariate normal distribution so that:
p(ot|ct) = N(ot|μct , �ct ) (4)
where ot is the MSER-based descriptor we computed as described
in Section 4 forthe t-th photo in the sequence to classify. μct and
�ct are instead the parametersrepresenting the ct-th
content-cluster in the reference sequence.
We already pointed out that, in some cases, photos in concurrent
sequences canlargely differ. It is possible that some photos do not
match any particular state ofthe reference sequence. In this case,
instead of forcing the method to associate eachphoto with a
content-cluster in the reference sequence, it is possible to
introducean additional state, we call “Null” state. Every time an
observation matches the Nullstate, then such photo is not
classified. We set the probability of an observation giventhe Null
state empirically to a constant value. However, other possible
models can beused for modeling this probability. Our goal here is
to demonstrate that some of thepictures can not be properly
classified by the event tree. In practice, other methodsmay be
adopted to handle information from the unclassified photos to
refine theevent tree and add new nodes.
-
Multimed Tools Appl
Given a sequence to classify, the inference of the
content-clusters correspondingto each photo has been performed by
maximizing the likelihood L of states andobservations. The
likelihood can be computed as:
L = p(c1) · p(o1|c1) ·N∏
i=2p(oi|ci) · p(ci|ci−1) (5)
where N is the number of photos to classify. We considered that
all the content-clusters have the same initial probability. To
maximize the likelihood, we usedthe Viterbi algorithm [3], that
consists in using a dynamic programming methodto maximize the joint
probability of the assignment of photos to leaf nodes. Wesummarize
the algorithm by means of the pseudo-code in Algorithm 3.
6 Experimental results
6.1 Dataset
We tested our framework on a dataset collected at a public event
that we will makefreely and publicly available. This dataset is
composed of four different sequencestaken at the same place from
four different users using different cameras. The nonprofessional
photographers who collected the dataset did not know the purpose
forwhich the dataset was being collected and took as many photos as
they wanted.Table 1 summarizes the characteristics of each photo
sequence. Sequences havebeen taken at a public conference. It is
possible to recognize three different sub-events: before arriving
at the conference (A), during the conference (B) and afterthe
conference—cocktail party (C). To determine the ground-truth of
content-clustering, we asked to four users to manually annotate the
dataset. We merged
-
Multimed Tools Appl
Table 1 Dataset for Photo Sequence Co-Organization
No. Before At the After Clusters Clusters Clusters ClustersSeq.
photos conf. (A) conf. (B) conf. (C) ann. 1 ann. 2 ann. 3 ann.
4
User 1 200 15 167 18 19 21 15 24User 2 56 4 52 0 8 8 6 6User 3
30 0 24 6 11 9 8 10User 4 70 19 37 14 37 31 26 36
all the sequences and split them in the previously described
three main groups (tomake easier for the user to annotate so many
photos). Each annotator assignedthe same label to all the photos
that he/she felt similar without considering thetemporal relations
among photos. None of the annotators was at the conference
andtimestamps were not available to them during the annotation.
Each annotator useda different personal criteria thus they provided
different partitions for the photos. Inpractice, for each sequence
we have four different ground-truth each one providedby a different
annotator. In the table we report for each sequence the number
ofdifferent classes found by each of the four annotators. It is
worth to stress thatsuch classes have been found by the annotators
for the merged sequences, that isconsidering all the 356 photos
together. For all the sequences together, Annotator1 found 49
clusters, Annotator 2 found 44 clusters, Annotator 3 found 32
clustersand, finally, Annotator 4 found 52 clusters. When splitting
the ground-truth for eachsequence, the number of annotated clusters
per annotator and per sequence arethose reported in Table 1. Figure
8 reports some pictures extracted from each ofthe 4 sequences and
shows that the photos’ content largely differs from sequence
tosequence making their organization more difficult.
Fig. 8 Images sampled from the sequences. In some cases, the
content largely differs
-
Multimed Tools Appl
As additional dataset for evaluating the hierarchical
organization of the referencesequence, we used the Gallagher
dataset [9], a public dataset of 589 photos collectedduring almost
six consecutive months by a single user. We asked a user to
manuallyclassify each of the photo in the Gallagher dataset
grouping all the photos lookingsimilar (we will make available also
such annotation). Photos have been presentedto the annotator all
together, and the annotator – who did not know temporalinformation
associated with each photo – used only visual information to group
thepictures.
6.2 Performance evaluation of the reference sequence
organization
Evaluation of the performance has been carried on computing the
accuracy of thepartitions provided by our method versus each of the
ground-truth annotations.When evaluating the performance of the
reference sequence organization, for eachcluster we find the
dominant label coming from the true annotation and count howmany
photos have a label equal to the dominant one. In this way, the
number ofcorrect matches for each detected content-cluster is the
number of photos in theintersection between the content-cluster
itself and the most overlapping annotatedcluster. We computed the
ratio between the correct matches across the clusters andthe total
number of photos in the sequence, that is:
accuracy =∑
∀content-cluster C Number of Correct Matches in cluster CNumber
of Photos in the Sequence
. (6)
We tested the hierarchical sequence organization on the
Gallagher dataset and oneach of the four sequences collected at the
conference. As for the four sequences wehave the ground-truth from
four different annotators, we compute the accuracy perannotator
independently.
6.2.1 Performance evaluation of the temporal segmentation
We performed experiments on all the available sequences to test
the temporal seg-mentation method. On each sequence, the
automatically computed threshold permitsthe method to adapt to the
temporal duration of the sequence. Figure 9 shows theplot of the
differences along the ordered timestamp sequence on the
Gallagherdataset [9]. Peaks correspond to large time differences
and can be considered as thestarting of possible new situations.
Our method estimates a threshold of almost 9.5 h,providing 117
different temporal segments for the whole collection. We observed
thateach situation represents photos taken within the same day. For
a collection acquiredin such a long temporal window this seems
indeed a reasonable result. The averageduration of the situations
was almost 2 h. In case a finer segmentation is required, itis
possible to easily extend the method asking the user to manually
set a threshold.It is also possible to apply our method recursively
on each situation until a minimumduration for each situation has
not been reached. However, in this work, we appliedour method only
once, to build the first level for the reference sequence
organization.Figure 10 shows an example of time segmentation we got
for the Gallagher datasetwhere, within the same temporal event,
more than a content-cluster can emerge asthe photos are visually
dissimilar.
-
Multimed Tools Appl
Fig. 9 Plot of the timestampdifferences of consecutivephotos
along the time orderedsequence. Peaks show where anew situation
starts
0 100 200 300 400 500 6000
50
100
150
200
250
300
350
400
Photo Indexes
Tim
esta
mp
Dif
fere
nces
(ho
urs)
In Table 2, we present the results we got on our 4 concurrent
photo sequences. Foreach sequence we report the number of photos,
the duration of the temporal windowthe photo sequence has been
acquired in, the automatically estimated threshold, the
Fig. 10 Situations detected for the first 15 photos of a photo
sequence. The time is referred to thefirst photo in the sequence.
Each situation is generally composed of a different number of
photosshowing a large variance of the visual content
-
Multimed Tools Appl
Table 2 Temporal segmentation on the 4 concurrent photo
sequences
Total no. Duration of Estimated No. of Avg. durationSequence of
photos time window(hours) threshold (min) situations (min.)
User 1 200 10 4.4 18 4.3User 2 56 8.5 17 5 24.2User 3 30 4 15,7
6 7User 4 70 9 10 8 15
number of situations detected and the average duration of each
situation. In the firstsequence, composed of a greater number of
photos, more situations have been found.For the other sequences,
where less photos have been taken, a comparable numberof situations
has been found.
For comparison purposes, we tested iPhoto v. 8.1.2 [12] to
automatically organizepictures based on time information. In
iPhoto, this is achieved by using the tool“automatic event
detection”. We noted that this tool groups together all the
photostaken during the same day. Therefore, for each of the four
sequences, the tooldiscovered only one event, as the pictures have
been taken in the same day. Whenconsidering all the photos
together, iPhoto discovered 2 events. Indeed, the cameraswere not
synchronized and one of them was set on the wrong day, originating
a newevent. Such kind of event organization, therefore, does not
estimate the real temporalevent structure as, instead, our method
attempts to do. On the Gallagher Dataset, wefound that iPhoto
computes exactly the same events (situations in our case) as
ourmethod does, where each detected event corresponds to the set of
pictures taken inthe same day.
6.2.2 Performance evaluation of the hierarchical sequence
organization
After applying the temporal segmentation to the Gallagher
dataset, we organizethe photos in each situation based on the
content as described in Section 4.2.Within each situation, only few
content-clusters (generally between 1 and 4) emerge.The measured
accuracy of the content-based organization within the
automaticallydetected situations is around 83%. Figure 11 shows the
hierarchical organization ofthe first 15 photos of the Gallagher
dataset.
We also measured the performance of the hierarchical
organization on each ofthe 4 concurrent sequences. In Table 3, we
report the results we got considering 4different annotations. Lower
performance corresponds, in general, to finer partitionsprovided by
the annotators.
Whilst the sequences used for testing (the Gallagher dataset and
the 4 concurrentphoto sequences) have different characteristics,
the performance is comparableand on average, independently on the
annotator, the average accuracy for thehierarchical organization is
82.44%. It is worth to note that, as the annotators didnot consider
the temporal relations among photos while annotating them,
actuallythe accuracy could be higher. Indeed, considering also the
time, annotators couldprovide more accurate partitions of the data.
This is difficult to realize in practice:first, the annotators have
to be aware of the temporal structure of a sequence, thenit is more
complex to ask them to provide annotations without conditioning
theirjudgements. For this reason we asked the annotators to limit
their attention on thecontent.
-
Multimed Tools Appl
Fig. 11 Hierarchical Organization of the first 15 photos of the
Gallagher dataset
6.3 Performance evaluation of the concurrent photo sequence
organization
To measure the performance of our concurrent photo-sequence
organization, weused the HMM to classify each photo in the sequence
and assign as label the index ofthe associated content-cluster in
the reference tree. Then, comparing the partitionscomputed by our
method to the available ground-truth, we computed the accuracyfor
each classified sequence by (6).
6.3.1 Simulations
To test the probabilistic framework used for concurrent photo
sequence organiza-tion, we generated a sequence of different
content-clusters by sampling mean and co-variance matrix for each
one of them. Then, by using such reference content-clusters,
Table 3 Accuracy of the Hierarchical Organization on each of the
4 concurrent photo sequences
Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4
Average(%) (%) (%) (%) (%)
User 1 81 85.5 88.5 83 84.5User 2 87.5 77 96.4 85.7 86.5User 3
83.33 80 90 84.33 84.41User 4 75.71 81.43 72.9 77.14 76.79
-
Multimed Tools Appl
Table 4 Accuracy of the concurrent photo sequence organization
over the tree computed for thesequence “User 1” – no Null state
Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%)
(%) (%)
User 1 - Tree 81 85.5 88.5 83User 2 78.6 76.8 87.5 76.8User 3
63.3 66.7 83.3 60User 4 65.7 65.7 60 65.7Average acc. for 72.15
73.9 79.8 71.4the whole set
we generated a photo sequence by randomly choosing a
content-cluster and enforc-ing the smoothness constraint in time
(that is the generated photo sequence havethe same temporal
structure of the reference one). Then we used our HMM andthe
Viterbi algorithm to infer the corresponding content-cluster in the
referencesequence for the generated photo sequence. This is not a
trivial experiment becauseit permits to test the model we imposed
for the state transition probabilities. Werandomly generated 100
couples “reference sequence—sequence to classify”. Whengenerating
the event structure, we generated 20 content-clusters. For each of
thiscluster we “toss a coin” an decided if the content-cluster
belongs to a new situationor not. We used this information to
generate a random number of photos for eachcontent cluster. On
average, the generated photo sequences were composed of 195photos.
We got an average accuracy of 95.80%.
6.3.2 Using the longest sequence as reference
We used the longest sequence (we call “User 1”) as reference
sequence. Weperformed experiments with and without considering the
Null state. In the lattercase (without Null state), we forced the
method to classify the photos also when theprobability of the match
is very low. That is the match was probably unreliable. Re-sults of
the organization of the remaining sequence considering different
annotationsare reported in Table 4, when the Null state has not
been used, and Table 5 when theNull state is used. In the latter
case, some photos may be unclassified. In particular,we got that
the 5.36%, 16.67%, and 27.14% of the photos have not been
classifiedrespectively for the sequence “User 2”, “User 3”, and
“User 4”.
Table 5 Accuracy of the concurrent photo sequence organization
over the tree computed for thesequence “User 1” – with Null
state
Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%)
(%) (%)
User 1 - Tree 81 85.5 88.5 83User 2 80.3 78.6 91.1 78.6User 3
60.67 70 86.67 60User 4 65.7 64.3 64.3 65.7Average acc. for 71.92
74.6 82.64 71.8the whole set
-
Multimed Tools Appl
Table 6 Accuracy of the concurrent photo sequence organization
over the tree computed for thesequence “User 3” – no Null state
Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%)
(%) (%)
User 3 - Tree 83.33 80 90 84.33User 1 60.5 66.5 69.5 53.5User 2
64.3 57.1 85.7 64.2User 4 44.3 47.1 45.7 44.3Average acc. for 63.10
62.10 72.72 61.58the whole set
6.3.3 Using the shortest sequence as reference
We then performed experiments using, as reference sequence, the
shortest one (wecall “User 3”). It is worth to note that “User 3”
does not have any photo about thesub-event A (see Table 1) and of
course this affects the performance of the methodbecause pictures
of the remaining sequences can not be properly classified.
Resultsof the organization of the remaining sequence considering
the different annotationsare reported in Table 6, when the Null
state has not been used, and Table 7 whenthe Null state is used
instead. In the latter case, some photos may be unclassified.
Inparticular, we got that the 6%, 0%, and 38.57% of the photos have
not been classifiedrespectively for the sequence “User 1”, “User
2”, and “User 4”.
6.3.4 Discussion and comparison to baseline method
As expected, comparing the tables it is possible to see how the
choice of the referencesequence affects the accuracy of the
organization. When using the sequence “User 1”,accuracy is higher.
This is due to the fact that this sequence is the most complete
andoverlaps with all the remaining sequences. When using the
sequence “User 3”, witha limited number of pictures and a limited
overlapping with the other sequences, theaccuracy decreases. It is
evident that more information are available, more reliableis the
event structure learning.
Adding the Null state, and considering the sequence “User 1” as
reference,accuracy slightly improves. The improvement is much
higher when using sequence“User 3” as reference at the cost of a
higher number of unclassified photos. Such un-classification rate
makes more evident how sequence “User 3” and sequence “User4” have
a limited overlapping.
Table 7 Accuracy of the concurrent photo sequence organization
over the tree computed for thesequence “User 3” – with Null
state
Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%)
(%) (%)
User 3 - Tree 83.33 80 90 84.33User 1 67 72.5 76 60User 2 64.3
57.1 85.7 64.2User 4 72.9 78.6 71.4 74.3Average acc. for 71.88
72.05 80.77 69.96the whole set
-
Multimed Tools Appl
The last row in Tables 4, 5, 6, and 7 reports the average
accuracy considering all the4 sequences. The accuracy for the
sequences organized in a tree are those reportedin Table 3.
As baseline method for comparison purposes, we consider the
content-basedorganization performed on the whole set of photos with
no distinction amongsequences. We applied the mean-shift clustering
over the MSER-based descriptors.We do not consider temporal
information that, in our hypothesis, is unreliable;we stress that
in practical situations, this information can be missing or
camerascan be not synchronized (as it happens in our dataset).
Accuracy of the baselinemethod has been measured as 57.3% when
using “Annotator 1”, 63.2% when using“Annotator 2”, 72.5% when
using “Annotator 3”, and 54.2% when using “Annotator4”. Comparing
such values with those reported in the previous tables, it is
possible tosay that our method generally outperforms the baseline
one.
7 Conclusions and future works
In this paper we face with the problem of organizing multiple
photo sequencestaken from different users with different cameras at
the same place and during thesame temporal window. Our framework
takes one of the sequence as reference andorganizes it
hierarchically in a tree to capture the event structure; then it
uses thistree to organize the photos of the other sequences. Our
method does not require thatcamera timestamps are synchronized, but
it requires the reference sequence to havea timestamp embedded. The
reference sequence has been organized hierarchicallybased on both
temporal and visual information. Leaves of the tree are the
content-clusters containing photos acquired in the same temporal
window and with similarcontent. To perform the co-organization, we
used an HMM to represent the temporaldynamics of the event and used
the Viterbi method to infer the sequence ofcontent-clusters to
which each photo to classify belongs to. Our experiments showour
technique is effective in co-organizing photo sequences and it may
providesemantically meaningful clusters.
However, our method presents some limitations as it is unable to
identify parallelsub-events. For example, during the same temporal
window, several events mayoccur in parallel and each photographer
may focus only on one of them. In this case,our method can provide
only the content-clusters for the events that are representedin the
reference photo sequence. In our implementation, we do not refine
thetree computed for the reference sequence. Techniques to perform
such refinementiteratively using the other sequences could be used
and this remains a topic for futureinvestigations. These techniques
may help to handle the sub-event case. For example,the unclassified
pictures in each sequence may be used to introduce new
content-clusters in the event tree able to represent the missing
parallel sub-events. In thiscase, suitable strategies have to be
defined to reasoning about the time informationof the added
sub-events. In future works we will consider the adoption of GMM
tomodel the content within each situation; each Gaussian component
would representa particular content cluster. We will also focus on
finding new techniques to co-organize the set of photo-sequences
without selecting a reference.
-
Multimed Tools Appl
In case more subsequent situations in the event tree have very
similar contentstructures, our method could be unable to recover
the correct alignment betweenreference sequence and sequence to
classify because the correspondences are foundconsidering only
visual similarities. Therefore, in future works, we will also
explorehow stricter temporal constraints can be considered when
defining the time structureof the event by adding new dependencies
into the dynamic Bayesian network.Indeed, considering also the
dynamics in the sequence to classify, it could be possibleto find a
finer alignment between the two photo sequences.
Finally, we believe photo sequence co-organization is still a
greatly unexploredproblem that opens potentially many future
directions to investigate as, for example,trying to estimate
difference among sensors of different cameras while
co-organizingthe sequences, and use then this information to
improve the multiple photo-sequenceco-organization.
Acknowledgements We thank all the anonymous reviewers and the
associate editor whose insight-ful comments and very constructive
reviews led to significant improvements of the manuscript.
References
1. Ardizzone E, La Cascia M, Vella F (2008) Mean shift
clustering for personal photo albumorganization. In: International
Conference on Image Processing, (ICIP). IEEE, San Diego, CA,pp
85–88, 12–15 Oct 2008
2. Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up
robust features. In: Proc of EuropeanConf on Computer Vision
(ECCV). Springer, Graz, Austria, pp 404–417, 7–13 May 2006
3. Bishop C (2006) Pattern recognition and machine learning, vol
4. Springer, New York4. Choi J, Yang S, Ro Y, Plataniotis K (2008)
Face annotation for personal photos using context-
assisted face recognition. In: Proc of int conf on Multimedia
Information Retrieval (MIR). ACM,Vancouver, Canada, pp 44–51, 30–31
Oct 2008
5. Chu WT, Lee YL, Yu JY (2009) Using context information and
local feature points in faceclustering for consumer photos. In:
Proc of Int Conf on Acoustics, Speech, and Signal
Processing(ICASSP). IEEE, Taipei, Taiwan, pp 1141–1144, 19–24 Apr
2009
6. Cooper M, Foote J, Girgensohn A, Wilcox L (2005) Temporal
event clustering for digital photocollections. ACM Trans Multi
Commun App (TOMCCAP) 1(3):269–288
7. Facebook (2004) http://www.facebook.com8. Fei-Fei L, Perona P
(2005) A Bayesian hierarchical model for learning natural scene
cate-
gories. In: Proc of int conf on Computer Vision and Pattern
Recognition (CVPR), vol 2. IEEE,San Diego, CA, pp 524–531, 20–26
June 2005
9. Gallagher A, Chen T (2008) Clothing cosegmentation for
recognizing people. In: Proc of Com-puter Vision and Pattern
Recognition (CVPR). IEEE, Anchorage, Alaska, 23–28 June 2008
10. Gong B, Jain R (2008) Hierarchical photo stream segmentation
using context. In: Proceedings ofIS&T/SPIE, vol 6820. SPIE, San
Francisco, CA, p 682003
11. Google+ (2011) http://plus.google.com/12. iPhoto (2009)
http://www.apple.com/ilife/iphoto13. Jaimes A, Benitez A, Chang S,
Loui A (2002) Discovering recurrent visual semantics in con-
sumer photographs. In: Proc of Int Conf on Image Processing
(ICIP), vol 3. IEEE, Rochester,New York, pp 528–531, 22–25 Sept
2002
14. Jang C, Yoon T, Cho H (2010) Digital photo classification
methodology for groups of photogra-phers. Multimed Tools Appl
50(3):441–463
15. Jiang H, Yu S (2009) Linear solution to scale and rotation
invariant object matching. In: Proc ofint conf on Computer Vision
and Pattern Recognition (CVPR). IEEE, Miami, FL, pp 2474–2481,20–25
June 2009
16. Leow W, Li R (2004) The analysis and applications of
adaptive-binning color histograms. Com-put Vis Image Underst (CVIU)
94(1–3):67–91
http://www.facebook.comhttp://plus.google.com/http://www.apple.com/ilife/iphoto
-
Multimed Tools Appl
17. Li C, Chiu C, Huang C, Chen C, Chien L (2006) Image content
clustering and summarizationfor photo collections. In: Proc of Int
Conf on Multimedia and Expo (ICME). IEEE, Toronto,Canada, pp
1033–1036, 9–12 July 2006
18. Li SZ (2005) Markov random field—modeling in computer
vision. Springer-Verlag19. Lin D, Kapoor A, Hua G, Baker S (2010)
Joint people, event, and location recognition in
personal photo collections using cross-domain context. In: Proc
of Eur Conf on Computer Vision(ECCV). Springer, Crete, Greece, pp
243–256, 5–11 Sept 2010
20. Lo Presti L, Morana M, La Cascia M (2010) A data association
algorithm for people re-identification in photo sequences. In: Int
Symposium on Multimedia (ISM). IEEE, Taichung,Taiwan, pp 318–323,
13–15 Dec 2010
21. Lo Presti L, Morana M, La Cascia M (2011) A data association
approach to detect and organizepeople in personal photo
collections. Multimed Tools Appl 1–32.
doi:10.1007/s11042-011-0839-5
22. Lowe D (1999) Object recognition from local scale-invariant
features. In: Proc of Int Conferenceon Computer Vision (ICCV), vol
2. IEEE, Kerkyra, Greece, pp 1150–1157, 20–27 Sept 1999
23. Matas J, Chum O, Urban M, Pajdla T (2004) Robust
wide-baseline stereo from maximally stableextremal regions. Image
Vis Comput 22(10):761–767
24. Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J,
Schaffalitzky F, Kadir T, Gool L(2005) A comparison of affine
region detectors. Int J Comput Vis 65(1):43–72
25. Sandhaus P, Boll S (2011) Semantic analysis and retrieval in
personal and social photo collec-tions. Multimed Tools Appl
51(1):5–33
26. Tavanapong W, Zhou J (2004) Shot clustering techniques for
story browsing. Trans Multimed6(4):517–527
27. Wu K, Yang M (2007) Mean shift-based clustering. Pattern
Recognition 40:3035–305228. Zhang L, Chen L, Li M, Zhang H (2003)
Automated annotation of human faces in family albums.
In: Proc of conf on multimedia (MM). ACM, Berkeley, CA, pp
355–358, 2–8 Nov 200329. Zhang L, Hu Y, Li M, Ma W, Zhang H (2004)
Efficient propagation for face annotation in family
albums. In: Proc of conf on multimedia (MM). ACM, New York, NY,
pp 716–723, 10–16 Oct 2004
Liliana Lo Presti received her Master Degree and PhD in computer
engineering from Universityof Palermo, Italy, in 2006 and 2010
respectively. During 2008–2009, she was visiting researcher atthe
Image and Video Computing Research Group at Boston University. In
2010–2011, she was post-doctoral researcher in the Computer
Engineering Department at University of Palermo. Currently,she is
post-doctoral researcher in the Computer Science Department at
Boston University. Herresearch interests include computer vision
and machine learning with application to
distributedvideo-surveillance systems, multimedia and information
filtering and retrieval.
http://dx.doi.org/10.1007/s11042-011-0839-5
-
Multimed Tools Appl
Marco La Cascia received his MSEE in electrical engineering and
PhD from University of Palermo,Italy, in 1994 and 1998,
respectively. During 1996–1999, he was a research associate in the
Image andVideo Computing Research Group at Boston University. After
that, he worked as a senior softwareengineer in the computer
telephony group at Offnet S.p.A. Rome, Italy. He joined the
Universityof Palermo as assistant professor at the end of 2000 and
he is currently associate professor at thesame University. His
research interests include low and mid-level computer vision,
image, and videodatabases retrieval, and vision-based video
surveillance. Dr. La Cascia has co-authored more that 50refereed
journal and conference papers.
Concurrent photo sequence
organizationAbstractIntroductionRelated worksProblem
definitionHierarchical sequence organizationTime
segmentationContent-based organizationContent
representationMean-shift clusteringContent-based clustering
Sequence co-organizationExperimental resultsDatasetPerformance
evaluation of the reference sequence organizationPerformance
evaluation of the temporal segmentationPerformance evaluation of
the hierarchical sequence organization
Performance evaluation of the concurrent photo sequence
organizationSimulationsUsing the longest sequence as referenceUsing
the shortest sequence as referenceDiscussion and comparison to
baseline method
Conclusions and future worksReferences