151 APPENDIX A THE NEURO BIOLOGY OF VISION AND ATTENTION A.1 Introduction To understand the concept to human attention, it is worth to regard the processing involved in human visual system as discussed in the following sections. A.2 The Human Visual System The light that achieves the eye is projected onto the retina and from there the optic nerve transmits the visual information to the optic chiasm. From there, two pathways go to each brain hemisphere; the collicular pathway leading to the Superior Coliculus (SC) and, more important, the retino-geniculate pathway, which transmits about 90% of the visual information and leads to the Lateral Geniculate Nucleus (LGN). From the LGN, the information is transferred to the primary visual cortex (V1). Up to here, the processing stream is also called primary visual pathway. From V1, the information is transmitted to the ―higher’ brain areas V2-V4, infero temporal cortex (IT), the middle temporal area (MT or V5) and the posterior parietal cortex (PP). A.3 The Eye The light that enters the eye through the pupil passes through the lens, travels through the clear vitreous humor that fills the central chamber of the eye and finally reaches the retina at the back of the eye. The retina is a light-sensitive surface and is densely covered with over 100 million photosensitive cells. The task of the photoreceptors is to change the electromagnetic energy of photons into neural activity that is needed as input by neurons. There are two categories of photoreceptor cells in the retina: rods and cones. The rods are more numerous, about 120 million, and are more sensitive to light than the cones. However, they are not sensitive to color. The cones (about 8 million)
39
Embed
APPENDIX A THE NEURO BIOLOGY OF VISION AND ATTENTION A.1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
151
APPENDIX A
THE NEURO BIOLOGY OF VISION AND ATTENTION
A.1 Introduction
To understand the concept to human attention, it is worth to regard the
processing involved in human visual system as discussed in the following sections.
A.2 The Human Visual System
The light that achieves the eye is projected onto the retina and from there the
optic nerve transmits the visual information to the optic chiasm. From there, two
pathways go to each brain hemisphere; the collicular pathway leading to the
Superior Coliculus (SC) and, more important, the retino-geniculate pathway, which
transmits about 90% of the visual information and leads to the Lateral Geniculate
Nucleus (LGN). From the LGN, the information is transferred to the primary visual
cortex (V1). Up to here, the processing stream is also called primary visual pathway.
From V1, the information is transmitted to the ―higher’ brain areas V2-V4, infero
temporal cortex (IT), the middle temporal area (MT or V5) and the posterior parietal
cortex (PP).
A.3 The Eye
The light that enters the eye through the pupil passes through the lens, travels
through the clear vitreous humor that fills the central chamber of the eye and finally
reaches the retina at the back of the eye. The retina is a light-sensitive surface and is
densely covered with over 100 million photosensitive cells. The task of the
photoreceptors is to change the electromagnetic energy of photons into neural
activity that is needed as input by neurons.
There are two categories of photoreceptor cells in the retina: rods and cones.
The rods are more numerous, about 120 million, and are more sensitive to light than
the cones. However, they are not sensitive to color. The cones (about 8 million)
152
provide the eye’s sensitivity: among the cones, there are three different types of
color reception: long wavelength cones (L-cones which are sensitive primarily to
the red portion of the visible spectrum (64%), middle-wavelength cones (m-cones)
sensitive to the green portion (32%), and short wavelength cones (s-cones) sensitive
to the blue portion (2%). The cones are much more concentrated in the central
yellow spot known as the macula. In the center of that region is the fovea centralis
or fovea, a 0.3mm diameter rod free area with very thin densely packed cones. It is
the center of the eye’s sharpest vision. This arrangement of cells has the effect that
visual scene is not perceived with the same resolution in all parts.
Figure A.1 Primary Visual pathway in the brain
Rather perceived in a small area with high resolution and the whole surrounding
only diffuse and coarse.
The photoreceptors are connected via bipolar cells with the ganglion cells.
Whereas photoreceptors and bipolar cells respond by producing graded potentials,
the ganglion cells are the first cells which produce spike discharges and so transform
the analog signal into a discrete one. The receptive field of a ganglion cell is circular
153
and separated into two areas: a center area and a surround area. There are two
different types of cells: a on-center cells which respond excitatory to light at the
center and off-center cells which respond inhibitory to light at the center. The
area surrounding the central region always has the opposite characteristic. There are
small ganglion cells and large ones. P ganglion cells receive their input just from the
cones are are more sensitive to color than to black and white, whereas the M
ganglion cells receive input from both rods and cones are more sensitive to
luminance contrasts.
The main contribution is the color opponency from the outputs of the three
cone system. The red-green contrast is derived from combining the excitatory
input from the L-cones and the inhibitory input from the M-cones, essentially
subtracting the signals from the L and M cones to compute the red-green
component of the stimulus (L-M). The green-red contrast is equally determined by
(M-L). The blue-yellow contrast is derived from the excitatory output of S-cones
and the inhibitory sum of the M and L cones (S-(L+M)) and the yellow-blue
contrast is determined by the excitatory sum of the M and L cones and the
inhibitory output of the S-cones ((M+L)-S). Finally, the luminance contrast is
derived by summing the excitation from all three cone types (S+M +L) (on-off
contrast) or by summing their inhibitory output (-S-M-L) (off-on contrast).
Fig A.2: The human eye
154
(a) (b)
Fig A.3 The perception of the retina shown in (b) for original image (a) [31]
A.4 The Optic Chiasm
The axons of the ganglion cells leave the eye via the optic nerve, which leads
to the optic chiasm. Here, the information from the two eyes is divided and
transferred to the two hemispheres of the brain: one half of each eye’s information is
crossed over to the opposite side of the brain while the other remains on the same
side. The effect is, that the left half of the visual field goes to the right half of the
brain and vice versa. From the optic chiasm, two pathways go to each hemisphere:
the smaller one goes to the superior colliculus, which is involved in the control of
eye movements. The more important pathway goes to the LGN of the thalamus and
from there to higher brain areas.
Figure A.4 The double opponency cells
155
A.5 The Lateral Geniculate Nucleus (LGN)
The Lateral Geniculate Nucleus (LGN) consists of six main layers composed
of cells that have centre-surround receptive fields similar to those of retinal ganglion
cells but larger and with stronger surround. Four of the LGN layers consist of
relatively small cells, the parvocellular cells, the other two of larger cells, the
magnocellular cells. The parvocellular cells process mainly the information from the
P – cells of the retina and are highly sensitive to color, especially to red-green
contrasts, whereas the magnocellular cells transmit information from the M-cells of
the retina and are highly sensitive to luminance contrasts. Below those six layers lie
the koniocellular sub layers, which respond mainly to blue-yellow contrasts. From
the LGN, the visual information is transmitted to the primary visual cortex at the
very back of the brain.
A.6 The Primary Visual Cortex (V1)
The primary visual cortex is with some 200 million cells the largest cortical
area in primates and is also one of the best investigated areas of the brain. It is
known by many different names. Besides the primary visual cortex, the most
common ones are V1 and the striate cortex.
V1 is essentially a direct map of the field of vision, organized spatially in the
same fashion as the retina itself. Any two adjacent areas of the primary visual cortex
contain information about two adjacent areas of the retinal ganglion cells. However,
V1 is not exactly a point-to-point map of the visual field. Although spatial
relationships are preserved, the densest part of the retina, the fovea, takes up a much
smaller percentage (1%) of the visual field than its representation in the primary
visual cortex (25%).
The primary visual cortex contains six major layers, giving it a striped
appearance. The cells in V1 can be classified into three types: simple cells, complex
cells, and hypercomplex cells. As the ganglion cells, the simple cells have an
excitatory and an inhibitory region. Most of the simple cells have an elongated
structure and, therefore, are orientation selective, which means, they fire most
156
rapidly when exposed to a line or edge of a particular direction [75]. Complex cells
take input from many simple cells. They have larger receptive field. Furthermore,
they are highly nonlinear and sensitive to moving lines or edges. Hypercomplex
cells, in turn, receive as input the signals from complex cells. These neurons are
capable of detecting lines of a certain length or lines that end in a particular area.
A.7 The Extrastriate Cortex And The Visual Pathways
From the primary visual cortex, a large collection of neurons sends
information to higher brain area. These areas are collectively called extrastriate
cortex, in opposite to the striped architecture of V1. The areas belonging to the
extrastriate cortex are V2, V3,V4, the infero-temporal cortex(IT), the middle
temporal area (MT or V5) and the posterior parietal cortex (PP). The notation V1 to
V5 comes from the former belief that the visual processing would be serial.
On the extrastriate areas, much less in known than on V1. It was later found
that the processing of visual information is highly parallel. Some of the areas
process mainly color, some form, and some motion. The functional separation
already started in the retina with the M-cells and P-cells and results in several
pathways leading to different brain areas in the extrastriate cortex. The statements
on the number of existing pathways differ: the most common belief is that there are
three main pathways, one color pathway, one form pathway, and one motion
pathway which is also responsible for depth processing. Other researchers mention
four pathways by separating the motion pathway into one motion and one depth
pathway whereas some mention some color, one motion and two form pathways.
The reason for this discordance of the extrastriate cortex has only started several
years ago and its functionality is not completely understood.
157
Figure A.5 The visual processing pathway.
The color and form pathways result from the P-cells of the retina and the
parvocellular cells of the LGN, go through V1, V2, and V4 and end finally in IT, the
area where the recognition of objects takes place. In other words, IT is concerned
with the question of ―what‖ is in a scene. Therefore, the color and form pathway
together are also called the what pathway. Other names are the P pathway or ventral
stream because of its location on the ventral part of the body. The motion ( and
depth) pathway result from the M-cells of the retina and the magnocellular cells of
the LGN, go through v1, V2, V3, MT(V5), and the parieto occipale area(PO) and
end finally in PP, responsible for the processing of motion and depth. Since this area
is concerned with the question of ―where‖ something ins in a scene, this pathway is
also called where pathway. Other names are the M pathway or dorsal stream because
it is considered to lie dorsally. The distinction into ―where‖ and ―what‖ pathway
traces back to [122]; a visualization of these pathways is shown in Figure A.5
Some cells respond to color, few only to luminance, some have chromatic
preference for red, green, yellow or blue and also to oriented edges. Some response
to more than one feature and hence a certain area of brain does the processing. The
processing of the visual information is usually bi-directional.
158
APPENDIX B
SCALE INVARIANT FEATURE TRANSFORM (SIFT)
B.1 Introduction
The Sift keys are highly distinctive, which allows a single feature to
be correctly matched with high probability against a large database of features using
a staged filtering approach. The first stage identifies key locations in scale space by
looking for locations that are maxima or minima of a difference-of-Gaussian
function. Each point is used to generate a feature vector that describes the local
image region sampled relative to its scale-space coordinate frame.
The key objects in the key frame are annotated using Scale Invariant Feature
Transform (SIFT) algorithm. The Input key frame is subjected to SIFT algorithm
[26] for the extraction of SIFT features. The extracted SIFT features are matched
against the SIFT database which consist of SIFT features for trained set of objects. If
there is a strong evidence for presence of object in the key frame, the key frame is
annotated along with the keyword for the object. Since, Video contains the same
objects at different scale, rotation and also it may subject to partial occlusion.
Therefore, to handle these special cases, a powerful algorithm SIFT is used here to
annotate the objects. The flow chart is shown in Figure B.1.
Figure B.1: Key Objects Annotator.
Key
frame SIFT
features
extraction
SIFT Feature
database
(training set)
Matching Annotation
159
This approach is based on a model of the behaviour of complex cells in the
cerebral cortex of mammalian vision. The resulting feature vectors are called SIFT
keys. Following are the major stages of computation in generating the set of image
features:
Scale-space peak selection: The first stage of computation must search over
all scales and image locations, but it can be implemented efficiently by using a
difference of-Gaussian function to identify potential interest points that are invariant
to scale and orientation.
Orientation assignment: One or more orientations are assigned to each key
point location based on local image properties. All future operations are performed
relative to the assigned orientation, scale, and location for each feature, providing
invariance to these transformations.
Key point descriptor: The local image gradients are measured at the selected
scale in the region around each key point, and transformed into a representation that
allows for local shape distortion and change in illumination.
B.2 Scale-Space Peak Selection
The first stage of key point detection is to detect locations that are invariant
to scale change of the image can be accomplished by searching for stable features
across all possible scales, using a scale space kernel Gaussian function. Therefore,
The scale space of an image is defined as a function, L(x, y, σ), that is produced from
the convolution of a variable-scale Gaussian, G( x, y, σ), with an input image, I (x, y):
L( x,y, ) G( x, y, ) * I x,y (B.1)
Where * is the convolution operation in x and y , and
2 2 2
2
12
2( x , y , )G e x y / (B.2)
160
To efficiently detect stable key point locations in scale space, using scale-
space peaks in the difference-of-Gaussian function convolved with the image,
D (x, y, σ), which can be computed from the difference of two nearby scales
separated by a constant multiplicative factor k:
D( x, y, ) (G( x, y, k ) G( x, y, ))* I x, y (B.3)
L( x, y, k ) L( x, y, ) (B.4)
Thus D can be computed by simple image subtraction of smoothed images (L).
Figure B.2: Difference of Gaussian Pyramid.
An efficient approach to construction of D (x, y, σ) is shown in Figure B.2.
The input image is incrementally convolved with Gaussians to produce images
separated by a constant factor k in scale space, shown stacked in the left column.
Adjacent image scales are subtracted to produce the difference-of-Gaussian images
shown on the right. Once a complete octave has been processed, the Gaussian image
is resampled that has twice the initial value of σ by taking every second pixel in each
row and column.
161
B.3 Local Extrema Detection
In order to detect the local maxima and minima of D (x, y, σ), each sample
point is compared to its eight neighbors in the current image and nine neighbors in
the scale above and below (see Figure B.3). It is selected only if it is larger than all
of these neighbors or smaller than all of them. The cost of the check is reasonably
low due to the fact that most sample points will be eliminated following the first few
checks.
Figure B.3: Maxima and minima of the difference-of-Gaussian images.
B.4 Orientation Assignment
By assigning a consistent orientation to each key point based on local image
properties, the key point descriptor can be represented relative to this orientation and
therefore achieve invariance to image rotation. The scale of the key point is used to
select the Gaussian smoothed image, L, with the closest scale, as all computations
must be performed in a scale-invariant manner. For each image sample, L(x, y), the
gradient magnitude, m(x, y), and orientation, ө( x ,y), is precomputed using pixel
differences:
2 21 1 1 1m x, y ( L( x , y ) L( x , y )) ( L( x, y ) L( x, y )) (B.5)
1e x , y tan L x,y 1 L x, y 1 / L x 1, y L x 1, y (B.6)
162
An Orientation histogram is formed from the gradient orientations of simple
points within a region around the keypoint the orientation histogram has 36 bins
covering the 360 degree range of orientations. Each sample added to the histogram
is weighted by its gradient magnitude and by a Gaussian-weighted circular window
with a σ that is 1.5 times that of the scale of the keypoint. Peaks in the orientation
histogram correspond to dominant directions of local gradients. The highest local
peak in the histogram is detected, and then any other local peak that is within 80%
of the highest peak is used to also create a key point with that orientation. Therefore,
for locations with multiple peaks of similar magnitude, there will be multiple key
points created at the same location and scale but different orientations.
B.5 Key Point Descriptor
The previous operations have assigned an image location, scale, and
orientation to each key point. These parameters impose a repeatable local 2D
coordinate system in which to describe the local image region, and therefore provide
invariance to these parameters. The next step is to compute a descriptor for the local
image region that is highly distinctive yetis as invariant as possible to remaining
parameters, such as change in illumination or 3D viewpoint.
Figure B.4 illustrates the computation of the key point descriptor. First the
image gradient magnitudes and orientations are sampled around a key point, using
the scale of the key point to select the level of Gaussian blur for the image. For
efficiency, the gradients are precomputed for all levels of the pyramid as described
in Section B.4. These are illustrated with small arrows at each sample location on
the left side of Figure B.4. A Gaussian weighting function with σ equal to one half
the width of the descriptor window is used to assign a weight to the magnitude of
each sample point. This is illustrated with a circular window on the left side of
Figure B.4, although, of course, the weight falls off smoothly. The purpose of this
Gaussian window is to avoid sudden changes in the descriptor with small changes in
the position of the window, and to give less emphasis to gradients that are far from
the center of the descriptor.
163
Figure B.4: Key Point Descriptor.
The key point descriptor is shown on the right side of Figure B.4 shows eight
directions for each orientation histogram, with the length of each arrow
corresponding to the magnitude of that histogram entry. The descriptor is formed
from a vector containing the values of all the orientation histogram entries,
corresponding to the lengths of the arrows on the right side of Figure B.4. The figure
shows a 2x2 array of orientation histograms, whereas our experiments used 4x4
arrays of histograms with 8 orientation bins in each, forming 4x4x8 = 128 element
feature vector for each key point. These key point descriptors are highly distinctive,
which allows a single feature to find its correct match with good probability in a
large database of features.
B.6 Matching
The key point descriptor of 128 vector dimension is stored in the database
for the training set of objects. Once the test image is obtained i.e., the key frame
from video summarization module is processed for sift features and these features
are matched against the database using Euclidean distance. If there is a strong
evidence for the presence of object, the object tag is annotated with the key frame.
164
APPENDIX C
TOOLBOXES
C.1 Saliency Toolbox
The Saliency Toolbox is a collection of Matlab functions and scripts for
computing the saliency map for an image, for determining the extent of a proto-
object, and for serially scanning the image with the focus of attention. It has been
cited in more than 100 papers.
System requirements:
Any computer and operating system that runs Matlab Release 13 or later
Image Processing Toolbox. The toolbox contains pre-compiled binary mex files for
MS Windows, Mac OS X (both Power PC and Intel Macs), and Linux (32 bit and 64
bit). The source code can be compiled on any system with the GNU C compiler gcc.
The Saliency Toolbox is in part a reimplementation of the iNVT toolkit at
Laurent Itti's lab at the USC. This toolbox complements the iNVT code in that it is
more compact (about 5,000 versus 360,000 lines of code) and easier to understand
and experiment with, but it only contains the core functionality for attending to
salient image regions.
Although time critical procedures are contained in mex files, processing an
image with the Saliency Toolbox in Matlab takes longer than with the iNVT code.
Whenever processing speed or feature richness is paramount, the iNVT code should
be preferred. For computing the saliency map or attending to salient proto-objects in
an image in a transparent and platform independent way, the Saliency Toolbox is a
good choice.
C.2 WEKA Introduction
WEKA stands for Waikato Environment for Knowledge Analysis. WEKA is
a collection of machine learning algorithms for data mining tasks. The algorithms