p. 1 Multimedia Information Retrieval
Dec 25, 2015
p. 2
Problem
On the Web and in local DBs a large amount of information is not textual:
audio (speech, music…) images, video, …
How can we efficiently retrieve multimedia information?
p. 3
Application examples
Web indexing: Multimedia retrieval from the Web Identify and ban (illegal or unauthorized) ads and
images Trademark & copyright Interactive museums Commercial DBs
p. 4
Application examples[2]
Satellite images (military, government, …) Medical images Entertainment Criminal investigation (scene analysis, face
recognition, ..) …
p. 5
First generation multimedia information retrieval systems
Off-line: multimedia documents are associated with a textual description Ex.: Manual annotation (“content descriptive
metadata”) The text surrounding an image in the document
(e.g. figure caption) On-line: using textual IR based on “keyword
match” (Google image)
p. 7
Limitation of textual approach
Manual annotation on large multimedia DBs is unfeasible
Describing a scene or an audio is highly subjective (different annotators might perceive/highlight different details)
p. 8
Precision might be quite low
Google Image can retrieve up to 80% NON RELEVANT DOCUMENTS even for specific queries
[1] Fergus, Fei-Fei, Perona, Zisserman, Learning Object Categories from Google’s Image Search, ICCV 05
p. 10
Current state-of-the-art retrieval models…
“Content Based” systems: Ignore the textual phase User query might be non-textual Model perceptual similarity bewteen the
query and the multimedia document Still limited to DBs (does not scale on the
Web)
p. 11
Examples of multimedia search queries
Find a song by singing the refrain Retrieving some soccer action frame
in a sport video Searching a paint with some
specific detail or texture or painting technique (e.g chiaroscuro)
…
p. 12
Current state-of-the-art retrieval models…[2]
Automated image annotation: Pre-processing (“information
extraction”): automatically extract some information from the image and associate it to some textual label
Retrieval is then a “traditional” text retrieval
p. 14
Image Retrieval wrt textual Retrieval
Analysis and representation of non-symbolic information
A text can be seen as a combination of atomic symbolic elements (words or tokens)
An image is a collection of non-symbolic elements (pixels) and an audio is represented as a wave ..there is no vocabulary of basic meaning elements, as for text!
p. 15
Basic elements of a Content Based Multimedia IR
On the users side: The query is a multimedial object (an image a
sketch an audio frame..) The output is an ordered list of element ranked
according to perceptual similarity wrt the query There are a variety of optional interactive features
to visualize image collections or give a feedback to the system
p. 18
Query by image example [2]
Note that the queryand the detail mightnot perfectly match,e.g. the query can be chosen from andimage prior toa restoration of the picture
p. 20
Basic elements of a Content Based Multimedia IR [2]
From the “system” perspective: Representation of the multimedia object (e.g.
what is the feature space) Modeling the notion of perceptual similarity
(e.g., trough specific matching algorithms) Efficient indexing of feature space (the
“vocabulary” is order of magnitude higher than for words)
Relevance feedback and visualization interface
p. 22
Representing an image through a set of features
As for text, a feature is a representation, through a vector of elements, of the image (or a detail l’ )
If I' is an image detail, then a feature f for l’ is defined as:f(I') Rk, f(I') = (v0, … vk-1)T,
k >= 1
p. 23
Representing an image through a set of features [2]
In general, a feature is a measurable characteristic of an image
The image is then represented using the measurable values of its selected features f1, …, fn
p. 24
Local and global features
I' = I: global feature (remember I image I’ detail)
I' I: local feature
Local Features : How to select relevant image parts that we want
to represent (I‘1, I‘2, …) Loca features allows it to cope with missing
elements, occlusions , background..
p. 25
Main problems in image representation Selecting features is crucial Just as for text, the same meaning can be
conveyed by apparently very different images (different according to specific features)
But the problem of “variability” is much harder
p. 33
Feature Extraction
What are image features? Primitive features
Mean color (RGB) Color Histogram
Semantic features Color Layout, texture etc…
Domain specific features Face recognition, fingerprint matching etc…
General features
p. 34
Examples of “simple” features : gray level histogram
Pixel intensity histogram in I': The range [0, 255] is partitioned in k bin Assign a bin to every pixel : I(p) -> divk(I(p)) f(I') = (v0, …, vk-1)T, where: vi = # { p I’ : divk(I(p)) = i}
p. 35
Frequency count of each individual color Most commonly used color feature
representation
Image
Corresponding histogram
Example
p. 38
Feature space
If we now use n features in R, then I can be represented as a feature vector x(I) = (f1(I), …fn(I))T
x(I) is a point in Rn, the feature space
p. 39
Feature Space [2]
More in general, if :fi(I) Rk, (a single feature is a k multidimensional
vector ) Then : x(I) = (f1(I) T … fn(I) T)T is a point in
Rn*k
p. 41
Feature Space[3]
The concept of feature space is similar BUT NOT IDENTICAL TO vector space model as in traditional IR (where real values are the tf*idf of words in document collection)
It is the most common, but not the unique, representation in content-based multimedia IR
p. 43
Perceptual similarity
In text retrieval, similarity between two documents is modeled as a function of the common words in the two documents (e.g. cosine similarity with tf*idf feature vectors)
In multimedia retrieval a similar notion of “distance” between vectors is applied…
p. 44
Perceptual similarity [2]
In the feature space, similarity is (inversely) proportional to a distance measure between feature vectors (not necessarily an Euclidean distance): dist(x(I1),x(I2))
Given the query Q, the system output is an image list I1, I2, … ordered according to:I1 = arg minI dist(x(Q),x(I)), …
p. 46
Perceptual similarity [3]
Other matching algorithms use more complex representations or more complex similarity functions, which are usually dependent on the type of multimedia object and retrieval tasks
p. 48
Indexing
Problem: efficiently index the data of a multi-dimensional space?
Several data structures (as IR keyword dictionary) are indexed using some ordering (e.g. alphabetic ordering): xi <= xj V xj <= xi (0 <= i,j <= N)
In Rk this cannot be done (remember every feature is multi-dimensional!)
p. 49
k-d Tree
It is a generalization of a binary search three with k dimensions
In each tree level we cyclically consider on of k features
p. 50
k-d Tree [2]
Suppose we wish to index a set of N k-dimensional points:
P1, …, PN, Pi Rk, Pi =(xi1, …, xi
k)
We select the first dimension (feature) and find the value L1, which is the median of x1
1, …, xN
1
p. 51
k-d Tree [3] The root of the tree includes L1 The left sub-tree (TL) includes the points Pi
s.t. xi1 <= L1
The left sub-tree (TR) wiIl include all the other points
At level 1, we select the second feature and, separately for TL e TR, we compute L2 and L3, selected such that : L2 i is median wrt the elements i xj1
2, xj22, …
of TL
L3 is the median of the elements in TR
p. 52
k-d Tree [4]
When the last k feature (point) has been considered, we backtract and cyclically consider agaion the first feature
Points are associated to the tree leaves
p. 53
Example
immagine presa da: Hemant M. Kakde, Range Searching using Kd Tree
We start with a set of 2-dimensional points. In L1, P5 ‘s x coordinate is the median of the dataset In L2, P2 is the median of y values in the partition, and in L3 P7 We then consider again x values, and in L4 the median is again P2 etc.
p. 55
..So far
We analyzed: Query types Feature types Similarity functions Indexing methods
Now we present retrieval methods Retrieval strategies clearly depend upon the
multimedia object representation technique
p. 56
Retrieval by color: color histograms
We can represent an image through the color histogram of an image part I‘ (we already seen how histograms are created for grey images): A single pixel can be represented with different
encodings P: RGB, HSV, CIE LAB, … Every channel (values range) is partitioned in k
bin: f(I') = (r0,…, rk-1, g0, …, gk-1, b0, …, bk-1)T, ri = # { p in I’: divk(R(p)) = i }, gi = # { p in I’: divk(G(p)) = i }, bi = # { p in I’: divk(B(p)) = i }
p. 57
Color histograms [2]
Alternatively , we divide RGB in k3 bin:
f(I') = (z0,…, zh-1)T, h= k3 the # of combinations of 3 values
If zi represents the triple of RGB values (i1, i2, i3), then :
zi = # { p in I’: divk(R(p)) = i1 and divk(G(p)) = i2and divk(B(p)) = i3 }
p. 60
Statistical Approach
Tamura features: based on the analysis of the local intensity distribution of the image, in order to measure perceptual characteristics of the feature, such as a
Contrast Granularity Direction
p. 63
Elements of a video
Frame: a single image Shot: A sequence of frames taken from a
single camera Scene: a set of consecutive shots that
reflect the same space, time and action
p. 64
Videosequence segmentation
If we can automatically identify “editing effects” (cuts, dissolvenze, ….) among shots, we can the automatically partition a video in shots
Identifying scenes is much more complicated, since this is a “semantic” concept
p. 65
Video search
Videos can be represented efficiently using “key frame” which are representative of every shot
A key frame can then be treated and processed as a “still image”: We can then apply all what we have just seen for
single images
p. 66
Video search
Alternatively, we can search in a video a specific “motion” (e.g., a specific trajectory of a soccer action, …)
p. 67
Audio retrieval
Several types of audio: Spoken audio
A whatever audio signal within the frequence range that can be perceived by the human ears (e.g. a thunderstorm)
Music: We must model the different instruments , musical effects,
etc.
p. 68
Audio Query types
Query by example: the input is an audio file, used to search “similar” files
Query by humming: User sings the searched melody
p. 69
Represenattion and similarity
The feature space can be obtained using e.g. histograms obtained from the spectral representation of the signal
Perceptual similarity is computed as the distance among multidimensional points, as for images Distance metrics: Euclidean, Mahalanobis,
histogram distance measures (Histogram Intersection, Kullback-Leibler divergence, ki-square, etc.)
p. 71
Content Based systems: limitations
All the information concerning the target multimidia objects is provided by the query (e.g., a given shape, or color, or audio signal)
p. 72
[2] Content Based systems: limitations Even if the representation and matching techniques are
sophisticated, it is difficult to distinguish shape changes that are still referring to the searched object from noise
p. 73
Limitations of Content Based systems [3] Human brain can distinguish among different shapes of
the same object only after having seen several objects of the same type in different positions
To obtain similar perfromances, artificial systems need to be trained to recognize objects using machine learning algorithms for:
Automated image annotation Automated image classification
p. 74
references
A. Del Bimbo, Visual Information Retrieval, Morgan Kaufmann Publishers, Inc. San Francisco, California", 1999
Forsyth, Ponce, Computer Vision, a Modern Approach 2003
p. 75
references[2]
Smeulders et al., Content-Based Image Retrieval at the End of Early Years, IEEE PAMI 2000
Long et al., Fundamentals of Content-based Image Retrieval, in: D. D. Feng, W. C. Siu, H. J. Zhang (Ed.),Multimedia Information Retrieval & Management-Technological Fundamentals and Applications, Springer-Verlag, New York(2003)
Foote et al., An Overview of Audio Information Retrieval, ACM Multimedia Systems, 1998
Hemant M. Kakde, Range Searching using Kd Tree, 2005