-
Multimedia Indexing And Retrieval Research at the Center for
IntelligentInformation Retrieval
R. Manmatha�
Multimedia Indexing and Retrieval GroupCenter for Intelligent
Information Retrieval
Computer Science DepartmentUniversity of Massachusetts, Amherst,
MA 01003
[email protected]
AbstractThe digital libraries of the future will include not
only
(ASCII) text information but scanned paper documentsas well as
still photographs and videos. There is, there-fore, a need to index
and retrieve information from suchmulti-media collections. The
Center for Intelligent Infor-mation Retrieval (CIIR) has a number
of projects to indexand retrieve multi-media information. These
include:
1. The extraction of text from images which may beused both for
finding text zones against generalbackgrounds as well as for
indexing and retrievingimage information.
2. Indexing hand-written and poorly printed docu-ments using
image matching techniques (word spot-ting).
3. Indexing images using their content.
1 Introduction
The digital libraries of the future will include not only(ASCII)
text information but scanned paper documentsas well as still
photographs and videos. There is, there-fore, a need to index and
retrieve information from suchmulti-media collections. The Center
for Intelligent In-formation Retrieval (CIIR) has a number of
projects toindex and retrieve multi-media information. These
in-clude:
1. Finding Text in Images: The conversion of scanneddocuments
into ASCII so that they can be in-dexed using INQUERY (CIIR’s text
retrieval en-gine). Current Optical Character Recognition
Tech-nology (OCR) can convert scanned text to ASCII
�
This material is based on work supported in part by the
NationalScience Foundation, Library of Congress and Department of
Com-merce under cooperative agreement number EEC-9209623, in partby
the United States Patent and Trademarks Office and the
DefenseAdvanced Research Projects Agency/ITO under ARPA order
numberD468, issued by ESC/AXS contract number F19628-95-C-0235, in
partby NSF IRI-9619117 and in part by NSF Multimedia
CDA-9502639.Any opinions, findings and conclusions or
recommendations expressedin this material are the author(s) and do
not necessarily reflect those ofthe sponsors.
but is limited to good clean machine printed fontsagainst clean
backgrounds. Handwritten text, textprinted against shaded or
textured backgrounds andtext embedded in images cannot be
recognized well(if it can be recognized at all) with existing
OCRtechnology. Many financial documents, for exam-ple, print text
against shaded backgrounds to pre-vent copying.
The Center has developed techniques to detect textin images. The
detected text is then cleaned upand binarized and run through a
commercial OCR.Such techniques can be applied to zoning text
foundagainst general backgrounds as well as for indexingand
retrieving images using the associated text.
2. Word Spotting: The indexing of hand-written andpoorly printed
documents using image matchingtechniques. Libraries hold vast
collections of orig-inal handwritten manuscripts, many of which
havenever been published. Word Spotting can be usedto create
indices for such handwritten manuscriptarchives.
3. Image Retrieval: Indexing images using their con-tent. The
Center has also developed techniques toindex and retrieve images by
color and appearance.
2 Finding Text in Images
Most of the information available today is either on pa-per or
in the form of still photographs and videos. Tobuild digital
libraries, this large volume of informationneeds to be digitized
into images and the text convertedto ASCII for storage, retrieval,
and easy manipulation.For example, video sequences of events such
as a bas-ketball game can be annotated and indexed by extract-ing a
player’s number, name and the team name that ap-pear on the
player’s uniform (Figure 1(b, c)). This maybecombined with methods
for image indexing and retrievalbased on image content (see section
3).
Current OCR technology [1, 20] is largely restrictedto finding
text printed against clean backgrounds, since
-
Text
Clean-up
Chip
Refinement
Character
Recognition
Text
Clean-up
I2
Segmentation
Texture
Segmentation
Texture
Segmentation
Texture
Segmentation
Texture
Generation
IInput Image
I1
...Generation
Generation
Generation
Chip
Chip
Chip
Chip Scale
Fusion
Chip
(a) (b) (c)
Figure 1: The system, example input image, and extracted text.
(a) The top level components of the text detection and
extractionsystem. The pyramid of the input image is shown as
�,���
,���������
; (b) An example input image; (c) Output of the system
beforebeing fed to the Character Recognition module.
in these cases it is easy to binarize the input images toextract
text (text binarization) before character recog-nition begins. It
cannot handle text printed againstshaded or textured backgrounds,
nor text embedded inpictures. More sophisticated text reading
systems usu-ally employ page segmentation schemes to identify
textregions. Then an OCR module is applied only to thetext regions
to improve its performance. Some of theseschemes [32, 33, 21, 23]
are top-down approaches, someare bottom-up methods [7, 22], and
others are based ontexture segmentation techniques in computer
vision [8].However, the top-down and bottom-up approaches usu-ally
require the input image to be binary and have a Man-hattan layout.
Although the approach in [8] can in prin-ciple be applied to
greyscale images, it was only usedon binary document images, and in
addition, the textbinarization problem was not addressed. In
summary,few working systems have been reported that can readtext
from document pages with both structured and non-structured
layouts. A brief overview of a system devel-oped at CIIR for
constructing a complete automatic textreading system is presented
here (for more details see[34, 35]).
2.1 System Overview
The system takes advantage of the following
distinctivecharacteristics of text which make it stand out from
otherimage information: (1) Text possesses a distinctive fre-quency
and orientation attributes; (2) Text shows spatialcohesion —
characters of the same text string are of sim-ilar heights,
orientation and spacing.
The first characteristic suggests that text may betreated as a
distinctive texture, and thus be segmentedout using texture
segmentation techniques. Thus, the firstphase of our system is
Texture Segmentation as shown inFigure 1(a). In the Chip Generation
phase, strokes areextracted from the segmented text regions. Using
rea-
sonable heuristics on text strings based on the
secondcharacteristic, the extracted strokes are then processed
toform tight rectangular bounding boxes around the corre-sponding
text strings. To detect text over a wide rangeof font sizes, the
above steps are applied to a pyramidof images generated from the
input image, and then theboxes formed at each resolution level of
the pyramid arefused at the original resolution. A Text Clean-up
mod-ule which removes the background and binarizes the de-tected
text is applied to extract the text from the regionsenclosed by the
bounding boxes. Finally, text boundingboxes are refined
(re-generated) by using the extracteditems as strokes. These new
boxes usually bound textstrings better. The Text Clean-up process
is then carriedout on the regions bounded by these new boxes to
extractcleaner text, which can then be passed through a com-mercial
OCR engine for recognition if the text is of anOCR-recognizable
font. The phases of the system arediscussed in the following
sections.
2.2 The Texture Segmentation Module
A standard approach to texture segmentation is to firstfilter
the image using a bank of linear filters such asGaussian
derivatives [11] or Gabor functions, followedby some non-linear
transformation such as a hyperbolicfunction ������������ . Then
features are computed to forma feature vector for each pixel from
the filtered im-ages. These feature vectors are then classified to
seg-ment the textures into different classes (for more detailssee
[34, 35]).
Figure 2(a) shows a portion of an original input im-age with a
variety of textual information to be extracted.There is text on a
clean dark background, text printedon Stouffer boxes, Stouffer’s
trademarks (in script), anda picture of the food. Figure 2(b) shows
the final seg-mented text regions.
-
(a) (b) (c) (d)
Figure 2: Results of Texture Segmentation and Chip Generation.
(a) Portion of an input image; (b) The final segmented textregions;
(c) Extracted strokes; (d) Text chips mapped on the input
image.
(a) (b) (c) (d)
Figure 3: The scale problem and its solution. (a) Chips
generated for the input image at full resolution; (b) half
resolution; (c)�
�
resolution; (d) Chips generated at all three levels mapped onto
the input image. Scale-redundant chips are removed.
2.3 The Chip Generation Phase
In practice, text may occur in images with complex back-grounds
and texture patterns, such as foliage, windows,grass etc. Thus,
some non-text patterns may pass the fil-ters and initially be
misclassified as text (Figure 2(b)).Furthermore, segmentation
accuracy at texture bound-aries is a well-known and difficult
problem in texturesegmentation. Consequently, it is often the case
that textregions are connected to other regions which do not
cor-respond to text, or one text string might be connected
toanother text string of a different size or intensity. Thismight
cause problems for later processing. For example,if two text
strings with significantly different intensitylevels are joined
into one region, one intensity thresholdmight not separate both
text strings from the background.
Therefore, heuristics need to be employed to refinethe
segmentation result. Since the segmentation processusually finds
text regions while excluding most of thosethat are non-text, these
regions can be used to direct fur-ther processing (focus of
attention). Furthermore, sincetext is intended to be readable,
there is usually a sig-nificant contrast between it and the
background. Thuscontrast can be utilized finding text. Also, it is
usuallythe case that characters in the same word/phrase/sentenceare
of the same font and have similar heights and inter-
character spaces. Finally, it is obvious that characters in
ahorizontal text string are horizontally aligned. Therefore,all the
heuristics above are incorporated in the Chip Gen-eration phase in
a bottom-up fashion: significant edgesform strokes (Figure 2(c));
strokes from the segmentedregions are aggregated to form chips
corresponding totext strings. The rectangular bounding boxes of the
chipsare used to indicate where the hypothesized (detected)text
strings are (Figure 2(d)). These steps are describedin detail in
[34, 35].
2.4 A Solution to the Scale ProblemThe frequency channels used
in the segmentation pro-cess work well to cover text over a certain
range of fontsizes. Text from larger font sizes is either missed or
frag-mented. This is called the scale problem. Intuitively,
thelarger the font size of the text, the lower the frequency
itpossesses. Thus, when the text font size gets too large,its
frequency falls outside the channels selected in sec-tion 2.2.
A pyramid approach (Figure 1(a)) is used to solve thescale
problem: a pyramid of the input image is formedand each image in
the pyramid is processed as describedin the previous sections. At
the bottom of the pyramidis the original image; the image at each
level (other thanthe bottom) has half of the resolution as that of
the im-
-
(a) (b)
(c)
Figure 4: Binarization results before and after the Chip
Refinement step. (a) Input image; (b) binarization result before
refinement;(c) after refinement.
-
age one level below. Text of smaller font sizes can bedetected
using the images lower in the pyramid (Figure3(a)), while text of
large font sizes is found using imageshigher in the pyramid (Figure
3(c). The bounding boxesof detected text regions at each level are
mapped back tothe original input image and the redundant boxes are
thenremoved as shown in Figure 3(d). Details are presentedin [34,
35].
2.5 Text on Complex BackgroundsThe previous sections describe a
system which detectstext in images and puts boxes around detected
text stringsin the input image. Since text may be printed
againstcomplex image backgrounds, which current OCR sys-tems cannot
handle well, it is desirable to have the back-grounds removed
first. In addition, OCR systems requirethat the text must be
binarized before actual recognitionstarts. In this system, the
background removal and textbinarization is done by applying an
algorithm to the textboxes individually instead of trying to
binarize the inputimage as a whole. This allows the process to
adapt to theindividual context of each text string. The details of
thealgorithm are in [34, 35].
2.6 The Text RefinementSometimes non-text items are identified
as text as well.In addition, the bounding boxes of the chips
sometimesdo not tightly surround the text strings. The
consequenceof these problems is that non-text items may occur inthe
binarized image, produced by mapping the extracteditems onto the
original page. An example is shown inFigure 4(a,b). These non-text
items are not desirable.
However, by treating the extracted items as strokes,the Chip
Refinement module which is essentially sim-ilar to the chip
Generation module but with strongerconstraints, can be applied here
to eliminate the non-text items and hence form tighter text
bounding boxes.This can be achieved because (1) the clean-up
proce-dure is able to extract most characters without attach-ing to
nearby characters and non-text items (Figure 4(b)),and (2) most of
the strokes at this stage are composed ofcomplete or almost
complete characters, as opposed tothe vertical connected edges of
the characters in the ini-tial processing. Thus, it can be expected
that the correcttext strokes comply more consistently with the
heuristicsused in the early Chip Generation phase. The
significantimprovement is clearly shown in 4c.
2.7 ExperimentsThe system has been tested over
���images from a wide
variety of sources: digitized video frames,
photographs,newspapers, advertisements in magazines or sales
flyers,and personal checks. Some of the images have regularpage
layouts, others do not. It should be pointed out thatall the system
parameters remain the same throughout
the entire set of test images, showing the robustness ofthe
system.
Characters and words (as perceived by one of the au-thors) were
counted in each image as ground truth. Thetotal numbers over the
whole test set are shown in the“Total Perceived” column in Table 1.
The detected char-acters and words are those which are completely
en-closed by the boxes produced after the Chip Scale Fu-sion step.
The total numbers of detected characters andwords over the entire
test set are shown in the “Total De-tected” column. Characters and
words clearly readableby a person after the Chip Refinement and
Text Clean-upsteps (final extracted text) are also counted for each
im-age, with the total numbers shown in the “Total Clean-up”
column. The column “Total OCRable” shows thetotal numbers of
cleaned-up characters and words thatappear to be of OCR
recognizable fonts in ��� of the bi-narized images. Note that only
the text which is horizon-tally aligned is counted (skew angle of
the text string isless than roughly 30 degrees)1. The “Total OCRed”
col-umn shows the numbers of characters and words from the“Total
OCRable” sets correctly recognized by Caere’scommercial WordScan
OCR engine.
Figure 5(a) is a portion of an original input imagewhich has no
structured layout. The final binarization re-sult is shown in (b)
and the corresponding OCR output isshown in (c). Notice that most
of the text is detected, andmost of the text of machine-printed
fonts are correctlyrecognized by the OCR engine. It should be
pointed outthat the cleaned-up output looks fine to a person in
theplaces where the OCR errors occurred.
3 Word Spotting: Indexing HandwrittenArchival Manuscripts
There are many historical manuscripts written in a sin-gle hand
which it would be useful to index. Exam-ples include the W. B.
DuBois collection at the Uni-versity of Massachusetts, Margaret
Sanger’s collectedworks at Smith College and the early Presidential
li-braries at the Library of Congress. These manuscriptsare largely
written in a single hand. Such manuscriptsare valuable resources
for scholars as well as others whowish to consult the original
manuscripts and consider-able effort has gone into manually
producing indicesfor them. For example, a substantial collection of
Mar-garet Sanger’s work has been recently put on microfilm(see
http://MEP.cla.sc.edu/Sanger/SangBase.HTM) withan item by item
index. These indices were created manu-ally. The indexing scheme
described here will help in theautomatic creation and production of
indices and concor-dances for such archives.
One solution is to use Optical Character Recognition(OCR) to
convert scanned paper documents into ASCII.
1Here, the focus is on finding horizontal, linear text strings
only.The issue of finding text strings of any orientation will be
addressed infuture work.
-
Table 1: Summary of the system’s performance. ��� images were
used for detection and clean-up. Out of these, 35 binarizedimages
were used for the OCR process.
Total Total Total Total TotalPerceived Detected Clean-up OCRable
OCRed
Char 21820 20788 (95%) 91% 14703 12428 (84%)Word 4406 4139 (93%)
86% 2981 2314 (77%)
(a) (b) (c)
Figure 5: Example 1. (a) Original image (ads11); (b) Extracted
text; (c) The OCR result using Caere’s WordScan Plus 4.0 on b.
Existing OCR technology works well with standard ma-chine
printed fonts against clean backgrounds. It workspoorly if the
originals are of poor quality or if the textis handwritten. Since
Optical Character Recognition(OCR) does not work well on
handwriting, an alternativescheme based on matching the images of
the words wasproposed by us in [18, 17, 15] for indexing such
texts.Here a brief summary of the work is presented.
Since the document is written by a single person, theassumption
is that the variation in the word images willbe small. The proposed
solution will first segment thepage into words and then match the
actual word imagesagainst each other to create equivalence classes.
Eachequivalence class will consist of multiple instances of thesame
word. Each word will have a link to the page itcame from. The
number of words in each equivalenceclass will be tabulated. Those
classes with the largestnumbers of words will probably be
stopwords, i.e. con-junctions such as “and” or articles such as
“the”. Classescontaining stopwords are eliminated (since they are
notvery useful for indexing). A list is made of the remain-ing
classes. This list is ordered according to the num-ber of words
contained in each of the classes. The userprovides ASCII
equivalents for a representative word ineach of the top m (say m =
2000) classes. The words inthese classes can now be indexed. This
technique will becalled “word spotting” as it is analogous to “word
spot-ting” in speech processing [9].
The proposed solution completely avoids machinerecognition of
handwritten words as this is a difficult task[20]. Robustness is
achieved compared to OCR systemsfor two reasons:
1. Matching is based on entire words. This is in con-trast to
conventional OCR systems which essen-tially recognize characters
rather than words.
2. Recognition is avoided. Instead a human is placedin the loop
when ASCII equivalents of the wordsmust be provided.
Some of the matching aspects of the problem are dis-cussed here
(for a discussion of page segmentation intowords, see [18]). The
matching phase of the problem isexpected to be the most difficult
part of the problem. Thisis because unlike machine fonts, there is
some variationin even a single person’s handwriting. This variation
isdifficult to model. Figure (6) shows two examples of theword
“Lloyd” written by the same person. The last imageis produced by
XOR’ing these two images. The white ar-eas in the XOR image
indicate where the two versions of“Lloyd” differ. This result is
not unusual. In fact, thedifferences are sometimes even larger.
The performance of two different matching techniquesis discussed
here. The first, based on Euclidean dis-tance mapping [2], assumes
that the deformation be-tween words can be modelled by a
translation (shift).The second, based on an algorithm by Scott and
Longuet
-
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Figure 6: Two examples of the word “Lloyd” and theXOR image
Higgins [28] models the transformation between wordsusing an
affine transform.
3.1 Prior WorkThe traditional approach to indexing documents
involvesfirst converting them to ASCII and then using a textbased
retrieval engine [30]. Scanned documents printedin standard machine
fonts against clean backgrounds canbe converted into ASCII using an
OCR [1]. However,handwriting is much more difficult for OCRs to
handlebecause of the wide variability present in handwriting(not
only is there variability between writers, but a givenperson’s
writing also varies).
Image matching of words has been used to recognizewords in
documents which use machine fonts [5, 10].Recognition rates are
much higher than when the OCRis used directly [10]. Machine fonts
are simpler tomatch than handwritten fonts since the variation is
muchsmaller; multiple instances of a given word printed in thesame
font are identical except for noise. In handwrit-ing, however,
multiple instances of the same word on thesame page by the same
writer show variations. The firsttwo pictures in Figure 6 are two
identical words from thesame document, written by the same writer.
It may thusbe necessary to account for these variations.
3.2 Outline of Algorithm1. A scanned greylevel image of the
document is ob-
tained.
2. The image is first reduced by half by gaussian filter-ing and
subsampling.
3. The reduced image is then binarized by threshold-ing the
image.
4. The binary image is now segmented into words. thisis done by
a process of smoothing and thresholding(see [18]).
5. A given word image (i.e. the image of a word) isused as a
template. and matched against all the otherword images. This is
repeated for every word inthe document. The matching is done in two
phases.First, the number of words to be matched is prunedusing the
areas and aspect ratios of the word im-ages - the word to be
matched cannot have an area
or aspect ratio which is too different from the tem-plate. Next,
the actual matching is done by usinga matching algorithm. Two
different matching al-gorithms are tried here. One of them only
accountsfor translation shifts, while the other accounts foraffine
matches. The matching divides the word im-ages into equivalence
classes - each class presum-ably containing other instances of the
same word.
6. Indexing is done as follows. For each equivalenceclass, the
number of elements in it is counted. Thetop n equivalence classes
are then determined fromthis list. The equivalence classes with the
highestnumber of words (elements) are likely to be stop-words (i.e.
conjunctions like ‘and’ , articles like‘the’, and prepositions like
‘of’) and are thereforeeliminated from further consideration. Let
us as-sume that of the top n, m are left after the stopwordshave
been eliminated. The user then displays onemember of each of these
m equivalence classes andassigns their ASCII interpretation. These
m wordscan now be indexed anywhere they appear in thedocument.
We will now discuss the matching techniques in detail.
3.3 Determination of Equivalence ClassesThe list of words to be
matched is first pruned using theareas and aspect ratios of the
word images. The prunedlist of words is then matched using a
matching algorithm.
3.4 PruningIt is assumed that
��
� ����������������������
� � (1)where � ������������� is the area of the template and
�������is the area of the word to be matched. Typical values of�
used in the experiments range between 1.2 and 1.3. Asimilar
filtering step is performed using aspect ratios (ie.the
width/height ratio). It is assumed that
�� � �������! � ���"�������#�$ � �������������
� �&%(2)
The value of�
used in the experiments range between 1.4and 1.7. In both the
above equations, the exact factors arenot important but it should
not be so large so that validwords are omitted, nor so small so
that too many wordsare passed onto the matching phase. The pruning
valuesmay be automatically determined by running statistics
onsamples of the document [15].
3.5 MatchingThe template is then matched against the image of
eachword in the pruned list. The matching function must sat-isfy
two criteria:
-
1. It must produce a low match error for words whichare similar
to the template.
2. It must produce a high match error for words whichare
dissimilar.
Two matching algorithms have been tried. The firstalgorithm -
Euclidean Distance Mapping (EDM) - as-sumes that no distortions
have occured except for rela-tive translation and is fast. This
algorithm usually ranksthe matched words in the correct order (i.e.
valid wordsfirst, followed by invalid words) when the variations
inwords is not too large. Although, it returns the low-est errors
for words which are similar to the template,it also returns low
errors for words which are dissimilarto the template. The second
algorithm [28],referred to asSLH here, assumes an affine
transformation between thewords. It thus compensates for some of
the variations inthe words. This algorithm not only ranks the words
in thecorrect order for all examples tried so far, it also seemsto
be able to better discriminate between valid words andinvalid
words. As currently implemented the SLH algo-rithm is much slower
than the EDM algorithm (we expectto be able to speed it up).
3.6 Using Euclidean Distance Mapping forMatching
This approach is similar to that used by [6] to match ma-chine
generated fonts. A brief description of the methodfollows (more
details are available from [18]).
Consider two images to be matched. There are threesteps in the
matching:
1. First the images are roughly aligned. In the verti-cal
direction, this is done by aligning the baselinesof the two images.
In the horizontal direction, theimages are aligned by making their
left hand sidescoincide.
The alignment is, therefore, expected to be accuratein the
vertical direction and not as good in the hori-zontal direction.
This is borne out in practice.
2. Next the XOR image is computed. This is done byXOR’ing
corresponding pixels (see Figure 6).
3. An Euclidean distance mapping [2] is computedfrom the XOR
image by assigning to each whitepixel in the image, its minimum
distance to a blackpixel. Thus a white pixel inside a blob is
assigneda larger distance than an isolated white pixel. Anerror
measure ������� can now be computed byadding up the distance
measures for each pixel.
4. Although the approximate translation has beencomputed using
step 1, this may not be accurate andmay need to be fine-tuned. Thus
steps (2) and (3)are repeated while sampling the translation space
inboth x and y. A minimum error measure � ����� ���is computed over
all the translation samples.
3.7 SLH Algorithm for MatchingThe EDM algorithm does not
discriminate well betweengood and bad matches. In addition, it
fails when there issignificant distortion in the words. This
happens with thewriting of Erasmus Hudson (Figure 7). Thus a
match-ing algorithm which models some of the variation isneeded. A
second matching algorithm (SLH), whichmodels the distortion as an
affine transformations, wastherefore tried (note that it is
expected that the real vari-ation is probably much more complex).
An affine trans-form is a linear transformation between coordinate
sys-tems. In two dimensions, it is described by
����������� (3)where � is a 2-D vector describing the
translation, � isa 2 by 2 matrix which captures the deformation, �
and are the coordinates of corresponding points in the twoimages
between which the affine transformation must berecovered. An affine
transform allows for the followingdeformations - scaling in both
directions, shear in bothdirections and rotation.
The algorithm chosen here is one proposed by Scottand
Longuet-Higgins [28] (see [16]). The algorithm re-covers the
correspondence between two sets of points Iand J under an affine
transform.
The sets I and J are created as follows. Every whitepixel in the
first image is a member of the set I. Similarly,every white pixel
in the second image is a member ofset J. First, the centroids of
the point sets are computedand the origins of the coordinate
systems is set at thecentroid. The SLH algorithm is then used to
computethe correspondence between the point sets.
Given the (above) correspondence between point setsI and J, the
affine transform ����� can be determined byminimizing the following
least mean squares criterion:
�����! �#"��%$ �'& ��( �)& � �+* (4)
where $ � �,( � are the (x,y) coordinates of point $ � and (
�respectively.
The values are then plugged back into the above equa-tion to
compute the error �-�.�/ . The error ���.�/ is anestimate of how
dissimilar two words are and the wordscan, therefore, be ranked
according to it.
It will be assumed that the variation for valid wordsis not too
large. This implies that if �1020 and � *2* areconsiderably
different from 1, the word is probably not avalid match.
Note: The SLH algorithm assumes that pruning on thebasis of the
area and aspect ratio thresholds is performed.
3.8 ExperimentsThe two matching techniques were tested ontwo
handwritten pages, each written by a differ-ent writer. The first
page can be obtained from
-
Figure 7: Part of a page from the collected papers of the Hudson
family
-
the DIMUND document server on the
internethttp://documents.cfar.umd.edu/resources/database/handwriting.database.html
This page will be referredto as the Senior document. The
handwriting on thispage is fairly neat (see [18] for a picture).
The secondpage is from an actual archival collection - the
Hudsoncollection from the library of the University of
Mas-sachusetts (part of the page is shown in Figure (7). Thispage
is part of a letter written by James S. Gibbons toErasmus Darwin
Hudson. The handwriting on this pageis difficult to read and the
indexing technique helped indeciphering some of the words.
The experiments will show examples of how thematching techniques
work on a few words. For more ex-amples of the EDM technique see
[18]. For more exam-ples using the SLH technique and comparisons
with theEDM technique see [16]. In general, the EDM methodranks
most words in the Senior document correctly butranks some words in
the Hudson document incorrectly.The SLH technique performs well on
both documents.
Both pages were segmented into words (see [18] fordetails) The
algorithm was then run on the segmentedwords. In the following
figures, the first word shownis the template. After the template,
the other words areranked according to the match error. Note that
only thefirst few results of the matching are shown although
thetemplate has been matched with every word on the page.The area
threshold � was chosen to be 1.2 and the aspectratio threshold
�was chosen as 1.4. The translation val-
ues were sampled to within ��
pixels in the X directionand �
�pixel in the y direction. Experimentally, this gave
the best results.
3.9 Results using Euclidean DistanceMapping
The Euclidean Distance Mapping algorithm works rea-sonably well
on the Senior document. An example isshown below.
In Figure (8), the template is the word “Lloyd”. Thefigure shows
that the four other instances of “Lloyd”present in the document are
ranked before any of theother words. As Table (2) shows, the match
errors forother instances of “Lloyd” is less than that for any
otherword. In the table, the first column is the Token number(this
is needed for identification purposes), the secondcolumn is a
transcription of the word, the third columnshows the area in
pixels, the fourth gives the match errorand the last two columns
specify the translation in the xand y directions respectively. Note
the significant changein area of the words.
The performance on other words in the Senior docu-ment is
comparable (for other examples see [18]). Thisis because the page
is written fairly neatly. The perfor-mance of the method is
expected to correlate with thequality of the handwriting. This was
verified by runningexperiments on a page from the Hudson collection
(Fig-
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Figure 8: Ranked matches for template “Lloyd” usingthe EDM
algorithm (the rankings are ordered from leftto right and from top
to bottom).
ure 7). The handwriting in the Hudson collection is diffi-cult
to read even for humans looking at grey-level imagesat 300 dpi The
writing shows wide variations in size - forexample, the area of the
word “to” varies by as much as100% ! However, this large a
variation is not expected tooccur and is not seen when the words
are larger. Sincehumans have difficulty reading this material, we
do notexpect that the method will perform very well on
thisdocument.
The Euclidean Distance Mapping technique fails forthe template
“Standard” in the Hudson document (seeFigure (9)). The failure
occurs because the two in-stances of “Standard” are written
differently. The tem-plate “Standard” has a gap between the “t” and
the “a”.This gap is not present in the second example of
“Stan-dard” (this is more clearly visible in Figure (10). A
tech-nique to model some distortions is, therefore, necessary.
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Figure 9: Rankings for template “Standard” using theEDM
algorithm(the rankings are ordered from left toright and from top
to bottom).
3.10 Experiments Using the SLHAlgorithm
The SLH algorithm handles affine distortions and is,therefore
more powerful then the EDM algorithm. Since
-
Token Word Area ������� � �� Xshift Yshift105 Lloyd 1360 0.000 0
070 Lloyd 1224 0.174 0 0165 Lloyd 1230 0.175 -2 0197 Lloyd 1400
0.194 4 0239 Lloyd 1320 0.197 -3 021 Maybe 1147 0.199 -1 0180 along
1156 0.200 1 0215 party 1209 0.202 1 0245 spurt 1170 0.205 -1 0121
dreary 1435 0.206 3 0
Table 2: Rankings and match Errors for template “Lloyd”.
Token Word Area CP � �.�/ A T105 Lloyd 1368 233 0.00 1.00 0.00
0.00
0.00 1.00 0.00197 Lloyd 1400 199 1.302 0.96 -0.04 1.58
0.01 1.04 0.1470 Lloyd 1224 176 1.356 0.94 0.09 -1.02
0.03 0.92 -1.38165 Lloyd 1230 189 1.631 1.03 0.05 -0.43
-0.01 0.87 -2.60239 Lloyd 1320 203 1.795 0.99 -0.05 1.44
0.03 1.07 2.21157 lawyer 1518 185 3.393 0.96 -0.03 1.89
0.05 1.11 0.03240 Selwyn 1564 188 3.673 0.94 0.06 -4.23
0.05 1.05 -0.7591 thought 1178 181 3.973 0.97 0.03 2.33
-0.01 1.08 2.91
Table 3: Rankings and Match Errors for template “Lloyd” Using
SLH Algorithm.
the current version of the SLH algorithm is slow, the ini-tial
matches were pruned using the EDM algorithm andthen the SLH
algorithm run on the pruned subset.
Experiments were performed using both the Seniordocument and the
Hudson documents. A few examplesare shown here (for more details
see [16]). For the Se-nior documents the same pruning ratios were
chosen asbefore. To account for the large variations in the
Hudsonpapers, the area threshold � was fixed at 1.3 and the as-pect
ratio threshold at 1.7. The value of � depends on theexpected
translation. Since it is small, � ��� % � . A lowervalue of � � � %
� yielded poorer results.
The matches for the template “Lloyd” are shown in Ta-ble (3).
The succesive columns of the table, tabulate theToken Number, the
transcription of the word, the area ofthe word image, the number of
corresponding points re-covered by the SLH algorithm, the match
error � �����using the SLH algorithm and the affine transform.
Theentries are ranked according to the match error � �.�/ .
Ifeither of � 0 0 or � *2* is less than 0.8 or greater than
1/0.8,that word is eliminated from the rankings. A comparisonwith
Table (2) shows that the rankings change. This is
not only true of the invalid words (for example the sixthentry
in Table (2) is “Maybe” while the sixth entry in Ta-ble (3) is
“lawyer” but is also true of the “Lloyd”’s. Bothtables rank
instances of “Lloyd” ahead of other words.The technique also shows
a much greater discriminationin match error - the match error for
“lawyer” is almostdouble the match error for the fifth “Lloyd”.
The method was also run on the Hudson document(Figure (7)) and
it ranked most of the words correctlyon this document. As an
example, we look at the word“Standard” on which the EDM method did
not rank cor-rectly. The SLH method produces the correct ranking
in-spite of the significant distortions in the word (see
Figure(10)).
3.10.1 Recall–Precision ResultsIndexing and retrieval techniques
may be evaluated us-
ing recall and precision. Recall is defined as the “pro-portion
of relevant documents actually retrieved” whileprecision is defined
as the “proportion of retrieved doc-uments that are relevant” [31].
Figure 3.10.1 shows therecall–precision results for both algorithms
on the Senior
-
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Figure 10: Rankings for template “Standard” for theSLH algorithm
(the rankings are ordered from left toright and from top to
bottom).
document. The two EDM graphs are for two differentvalues of the
area ratio (1.22 and 1.3). Notice that theydo not differ
significantly, thus showing that the exactvalues of the area ratio
are not significant. The averageprecision for the EDM and SLH
algorithms on the Seniordocument are 79.7 % and 86.3 %
respectively. Note thatSLH performs significantly better than EDM.
Similar re-sults are obtained with the Hudson document.
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Figure 11: Recall precision results for Senior document
4 Image Retrieval
The indexing and retrieval of images using their contentis a
poorly understood and difficult problem. A personusing an image
retrieval system usually seeks to find se-mantic information. For
example, a person may be look-
ing for a picture of a leopard from a certain viewpoint.
Oralternatively, the user may require a picture of AbrahamLincoln
from a particular viewpoint.
Retrieving semantic information using image contentis difficult
to do. The automatic segmentation of an im-age into objects is a
difficult and unsolved problem incomputer vision. However, many
image attributes likecolor, texture, shape and “appearance” are
often directlycorrelated with the semantics of the problem. For
exam-ple, logos or product packages (e.g., a box of Tide) havethe
same color wherever they are found. The coat of aleopard has a
unique texture while Abraham Lincoln’sappearance is uniquely
defined. These image attributescan often be used to index and
retrieve images.
The Center has carried out pioneering research in thisarea. The
Center conducts research in both color basedimage retrieval see and
appearance based image retrieval(the methods applied to appearance
based image retrievalmay also be directly applied to texture based
image re-trieval). We will now discuss appearance based
retrieval(the reader is referred to [3] for discussions about
thecolor based retrieval.
4.1 Retrieval by AppearanceSome attempts have been made to
retrieve objects usingtheir shape [4, 24]. For example, the QBIC
system [4],developed by IBM, matches binary shapes. It requiresthat
the database be segmented into objects. Since auto-matic
segmentation is an unsolved problem, this requiresthe user to
manually outline the objects in the database.Clearly this is not
desirable or practical.
Except for certain special domains, all methods basedon shape
are likely to have the same problem. An ob-ject’s appearance
depends not only on its three dimen-sional shape, but also on the
object’s albedo, the view-point from which it is imaged and a
number of otherfactors. It is non-trivial to separate the different
factorsconstituting an object’s appearance. For example, it
isusually not possible to separate an object’s three dimen-sional
shape from the other factors.
The Center has overcome this difficulty by develop-ing methods
to retrieve objects using their appearance[26, 27, 19, 25]. The
methods involve finding objectssimilar in appearance to an example
object specified bythe query.
To the best of our knowledge, ours is the first gen-eral query
by appearance image retrieval system. Sys-tems have been built to
retrieve specific objects like faces(e.g., [29])). However, these
systems require a number oftraining examples and it is not clear
whether they can begeneralized to retrieve other objects.
Some of the salient features of our system include:
1. The ability to retrieve “similar” images. This is incontrast
with techniques which try to recover thesame object. In our system,
a car used as a query
-
will also retrieve other cars rather than retrievingonly cars of
a specific model.
2. The ability to retrieve images embedded in a back-ground (see
for example the cars in Figure 13 whichappear against various
backgrounds).
3. It does not require any prior manual segmentationof the
database.
4. No training is required.
5. It can handle a range of variations in size.
6. It can handle 3D viewpoint changes up to about 20to 25
degrees.
The user constructs the query by taking an examplepicture, and
marking regions which she considers impor-tant aspects of the
object. The query may be refined laterdepending on the retrieval
results. Consider, for exam-ple, the first car shown in Figure 4.1.
The user marks theregion shown in the figure using a mouse. Notice
thatthe region reflects the fact that wheels are central to acar.
The user’s query in this situation is to find visuallysimilar
objects (i.e., other cars) from a similar viewpoint(where the
viewpoint can vary up to 25 degrees from thequery).
The database images are filtered with derivatives ofGaussians at
multiple scales. Derivatives of the first andsecond order are used.
Differential invariants (invariantsto 2D rotation) are created
using the derivatives. [19, 25].An inverted list is constructed
from these invariants. Theinverted list is indexed using the value
of each invariant.The entire computation may be carried out
off-line.
The on-line computation consists of calculating invari-ants for
points in the query (which is a region in the im-age). Points with
similar invariant values are now re-covered from the database by
indexing on the invariantvalues. The points obtained by indexing
must also sat-isfy certain spatial constraints. That is, the values
ofthe invariants at a pixel and at some of its neighborsmust match.
This ensures that the indexing scheme pre-serves the spatial layout
of objects. Points which satisfythis spatial relationship vote and
the database images areranked on the basis of this vote.
The scheme described above works if the object isroughly the
same size in the query and the imagedatabase. In practice it is
quite common for the objectsto be of different sizes in a database.
The variation insize is handled by doing a search over scale space.
Thatis, the query is filtered with Gaussian derivatives of
dif-ferent standard deviations [14, 13, 12] and the image
si-multaneously warped. This allows objects over a rangeof sizes to
be matched [26, 27].
The query is outlined by the user with a mouse Figure4.1. Figure
13 shows the results of a query. Notice thata large number of cars
with white wheels have been re-trieved. For more examples, see [19,
25]. This retrieval
Figure 12: Car Query for retrieval by indexing
was performed on a database of 1600 images taken fromthe
Internet, the Library of Congress and other sources.The database
consists of faces, monkeys, apes, cars,diesel and steam locomotives
and a few houses. Lightingand camera parameters are not known.
5 Conclusion
This paper has described the multimedia indexing andretrieval
work being done at the Center for Intelligent In-formation
Retrieval. Work on systems for finding textin images, indexing
archival handwritten documents andimage retrieval by content has
been described. The re-search described is part of an on-going
research effortfocused on indexing and retrieving multimedia
informa-tion in as many ways as possible. The work describedhere
has many applications, principally in the creation ofthe digital
libraries of the future.
6 Acknowledgements
This paper includes research contributions by Victor Wuand
Srinivas Ravela of the multimedia indexing and re-trieval group.
David Hirvonen and Adam Jenkins pro-vided programming support. Ed
Riseman gave com-ments on some of this work. I would like to
thankBruce Croft and CIIR for supporting this work and GailGiroux
and the University of Massachusetts Library forthe scanned page
from the Hudson collection.
References[1] M. Bokser. Omnidocument technologies.
Proceedings
IEEE, 80(7):1066–1078, 1992.
[2] Per-Erik Danielsson. Euclidean distance mapping. Com-puter
Graphics and Image Processing, 14:227–248,1980.
[3] M. Das, E. M. Riseman, and B. A. Draper. Focus :Searching
for multi-colored objects in a diverse imagedatabase. accepted to
the IEEE CVPR ’97, June 1997.
[4] Myron Flickner et al. Query by image and video content:The
qbic system. IEEE Computer Magazine, pages 23–30, Sept. 1995.
-
[5] L. D. Wilcox F. R. Chen, D. S. Bloomberg. Spottingphrases in
lines of imaged text. In Proceedings of theSPIE conf. on Document
Recognition II, volume 2422,pages 256–269, San Jose, CA, Feb.
1995.
[6] Paul Filiski and Jonathan J. Hull. Keyword selection
fromword recognition results using definitional overlap. InThird
Annual Symposium on Document Analysis and In-formation Retrieval,
UNLV, Las Vegas, pages 151–160,1994.
[7] L. Fletcher and R. Kasturi. A robust algorithm for
textstring separation from mixed text/graphics images.
IEEETransactions on Pattern Analysis and Machine Intelli-gence,
10(6):910–918, Nov. 1988.
[8] Anil K. Jain and Sushil Bhattacharjee. Text
SegmentationUsing Gabor Filters for Automatic Document
Processing.Machine Vision and Applications, 5, 1992.
[9] G. J. F. Jones, J. T. Foote, K. Sparck Jones, and S.
J.Young. Video mail retrieval: The effect of word spot-ting
accuracy on precision. In International Conferenceon Acoustics,
Speech and Signal Processing, volume 1,pages 309–316, 1995.
[10] Siamak Khoubyari and Jonathan J.Hull. Keyword loca-tion in
noisy document image. In Second Annual Sympo-sium on Document
Analysis and Information Retrieval,UNLV, Las Vegas, pages 217–231,
1993.
[11] J. Malik and P. Perona. Preattentive texture
discrimina-tion with early vision mechanisms. Journal of the
OpticalSociety of America A, 7(5):923–932, May 1990.
[12] R. Manmatha. Image matching under affine deforma-tions. In
Invited Paper, Proc. of the 27nd Asilomar IEEEConf. on Signals,
Systems and Computers, pages 106–110, 1993.
[13] R. Manmatha. A framework for recovering affine trans-forms
using points, lines or image brightnesses. In Proc.Computer Vision
and Pattern Recognition Conference,pages 141–146, 1994.
[14] R. Manmatha. Measuring the affine transform using gaus-sian
filters. In Proc. 3rd European Conference on Com-puter Vision,
pages 159–164, 1994.
[15] R. Manmatha and W. B. Croft. Word spotting:
Indexinghandwritten manuscripts. In Mark Maybury, editor,
In-telligent Multi-media Information Retrieval. AAAI/MITPress,
April 1998.
[16] R. Manmatha, Chengfeng Han, and E. M. Riseman.
Wordspotting: A new approach to indexing handwriting. Tech-nical
Report CS-UM-95-105, Computer Science Dept,University of
Massachusetts at Amherst, MA, 1995.
[17] R. Manmatha, Chengfeng Han, and E. M. Riseman.
Wordspotting: A new approach to indexing handwriting. InProc.
Computer Vision and Pattern Recognition Confer-ence, pages 631–637,
1996.
[18] R. Manmatha, Chengfeng Han, E. M. Riseman, and W. B.Croft.
Indexing handwriting using word matching. InDigital Libraries ’96:
1st ACM International Conferenceon Digital Libraries, pages
151–159, 1996.
[19] R. Manmatha and S. Ravela. A syntactic characteriza-tion of
appearance and its application to image retrieval.In Proceedings of
the SPIE conf. on Human Vision andElectronic Imaging II, volume
3016, San Jose, CA, Feb.1997.
[20] S. Mori, C. Y. Suen, and K. Yamamoto. Historical re-view of
ocr research and development. Proceedings ofthe IEEE,
80(7):1029–1058, July 1992.
[21] G. Nagy, S. Seth, and M. Viswanathan. A PrototypeDocument
Image Analysis System for Technical Journals.Computer, pages 10–22,
July 1992.
[22] Lawrence O’Gorman. The Document Spectrum for PageLayout
Analysis. IEEE Trans. Pattern Analysis and Ma-chine Intelligence,
15(11):1162–1173, Nov. 1993.
[23] Theo Pavlidis and Jiangying Zhou. Page Segmentationand
Classification. CVGIP: Graphical Models and ImageProcessing,
54(6):484–496, Nov. 1992.
[24] A. Pentland, R. W. Picard, and S. Sclaroff. Photo-book:
Tools for content-based manipulation of databases.In Proc. Storage
and Retrieval for Image and VideoDatabases II,SPIE, volume 185,
pages 34–47, 1994.
[25] S. Ravela and R. Manmatha. Image retrieval by appear-ance.
In Accepted to the 20th Intl. Conf. on Research andDevelopment in
Information Retrieval (SIGIR’97), July1997.
[26] S. Ravela, R. Manmatha, and E. M. Riseman. Image re-trieval
using scale-space matching. In Bernard Buxtonand Roberto Cipolla,
editors, Computer Vision - ECCV’96, volume 1 of Lecture Notes in
Computer Science,Cambridge, U.K., April 1996. 4th European Conf.
Com-puter Vision, Springer.
[27] S. Ravela, R. Manmatha, and E. M. Riseman. Scale
spacematching and image retrieval. In Proc. DARPA
ImageUnderstanding Workshop, 1996.
[28] G. L. Scott and H. C. Longuet-Higgins. An algorithmfor
associating the features of two patterns. Proc. RoyalSociety of
London B, B244:21–26, 1991.
[29] M. Turk and A. Pentland. Eigenfaces for recognition. J.of
Cognitive NeuroScience, 3:71–86, 1991.
[30] H.R. Turtle and W.B. Croft. A comparison of text
retrievalmodels. Computer Journal, 35(3):279–290, 1992.
[31] C. J. van Rijsbegen. Information Retrieval.
Butterworths,1979.
[32] F. Wahl, K. Wong, and R. Casey. Block segmentationand text
extraction in mixed text/image documents. Com-puter Vision Graphics
and Image Processing, 20:375–390, 1982.
[33] D. Wang and S. N. Srihari. Classification of newspaperimage
blocks using texture analysis. Computer VisionGraphics and Image
Processing, 47:327–352, 1989.
[34] V. Wu, R. Manmatha, and E. M. Riseman. Finding TextIn
Images. Technicial Report 97-09, Computer ScienceDepartment, UMass,
Amherst, MA, 1997.
[35] V. Wu, R. Manmatha, and E. M. Riseman. Finding TextIn
Images. accepted to the Second ACM Intl. conf. onDigitial Libraries
DL’97, July 1997.
-
Figure 13: The results of the car query.