-
IMAGE SEGMENTATION OF HISTORICAL DOCUMENTS
Carlos A.B. Mello and Rafael D. Lins Department of Electronics
and Systems UFPE Brazil
{cabm, rdl}@cin.ufpe.br
ABSTRACT This paper presents a new entropy-based segmentation
algorithm for images of documents. The algorithm is used to
eliminate the noise inherent to the paper itself specially in
documents written on both sides. It generates good quality
monochromatic images increasing the hit rate of OCR commercial
tools.
I. INTRODUCTION
We are interested in processing and automatic transcription of
historical documents from the nineteenth century onwards. Image
segmentation [2] of this kind of documents is more difficult than
more recent documents because while the paper colour darkens with
age, the printed part either handwritten or typed, tends to fade.
These two factors acting simultaneously narrows the discriminations
gap between the two predominant colour clusters of documents. If a
document is typed or written on both sides and the opacity of the
paper is such as to allow the back printing to be visualized on the
front side, the degree of difficulty of good segmentation increases
enormously. A new set of hues of paper and printing colours
appears. Better filtering techniques are needed to filter out those
pixels, reducing back-to-front noise.
The segmentation algorithm presented was applied to documents
from Joaquim Nabucos1 file [5,12] held by Joaquim Nabuco Foundation
(a research center in Recife-Brazil). The segmentation process is
used to generate high quality greyscale or monochromatic images.
Figure 1 shows the application of a nearest colour algorithm for
decreasing the colours of a sample document from Nabucos bequest,
using Adobe Photoshop [10]. The document is written on both sides
the colour reduction process has not produced satisfactory results
as the ink of one side of the paper interferes with the
monochromatic image of the other side.
This paper introduces a new entropy-based segmentation algorithm
and compares it with three of the most important entropy-based
segmentation algorithms
1 Brazilian statesman, writer, and diplomat, one of the key
figures in the campaign for freeing black slaves in Brazil,
Brazilian ambassador to London (b.1861-d.1910).
described in the literature. Two different grands for comparison
are presented: visual inspection of the filtered document and the
response of Optical Character Recognition (OCR) tools.
II. ENTROPY-BASED SEGMENTATION The documents of Nabucos file are
digitized with 200 dpi in true colour and then converted to
256-level greyscale format by using the equation:
C = 0.3*R + 0.59*G + 0.11*B where C is the new greyscale colour
and R, G and B are, respectively, the Red, Green and Blue
components of the palette of the original colour image.
Three segmentation algorithms based on the entropy function [1]
are applied to greyscale images and are studied here: Pun [9],
Kapur et al [3] and Johannsen [8].
A. Puns Algorithm
Puns algorithm analyses the entropy of black pixels, Hb, and the
entropy of the white pixels, Hw, bounded by the threshold value t.
The algorithm suggests that t is such that maximizes the function H
= Hb + Hw, where Hb and Hw are defined by:
=
-=t
i
ipipHb0
])[log(][ (Eq. 1)
+=
-=255
1
])[log(][ti
ipipHw (Eq. 2)
where p[i] is the probability of pixel i with colour colour[i]
is in the image. The logarithm function is taken in base 256.
Figure 2 presents the application of Puns algorithm to the sample
image shown in figure 1-left.
B. Kapur et als Algorithm
Reference [3] defines a probability distribution A for an object
and a distribution B to the background of the document image, such
that: A: p0/Pt, p1/Pt, ..., pt/Pt B: (pt+1)/(1 Pt), (pt + 2)/(1 -
Pt),..., p255/(1 Pt)
-
The entropy values Hw and Hb are evaluated using equations (1)
and (2) above, with p[i] following are applied to greyscale images
the previous distributions. The maximization of the function Hw +
Hb is analysed to define the threshold value t. The sample image of
figure 1-left is segmented with this algorithm and the result is
presented on figure 3.
C. Johannsens Algorithm
Another variation of an entropy-based algorithm is proposed by
Johannsen trying to minimize the function Sb(t) + Sw(t), with:
)]()()[/1()log()(255
1
255
1
255
1
+=+=+=
++=ti
itti
iti
iw pEpEpptS
and
)]()()[/1()log()(1
000
-
===
++=t
iit
t
ii
t
iib pEpEpptS
where E(x)=-xlog(x) and t is the threshold value. Figure 4
presents the application of this algorithm to the image of the
document under study.
Figure 1. (left) Original image in 256 greyscale levels and
(right) its monochromatic version generated by
Photoshop
Figure 2.Puns algorithm applied to document
Figure 3. Kapur et als segmentation
Figure 4. Johannsens segmentation.
III. A NEW SEGMENTATION ALGORITHM The algorithm scans the image
looking for the most frequent colour, which is likely to belong to
the image background (the paper). This colour is used as inicial
threshold value, t, to evaluate Hw and Hb as defined in equations
(1) and (2).
The entropy H of the complete histogram of the image is also
evaluated. It must be noticed that in this new algorithm the
logarithmic function used to evaluate H, Hw and Hb is taken with a
base equal to the product of the dimensions of the image. This
means that, if the image has dimensions x versus y, the logarithmic
base is x.y. As can be seen in [4], this does not change the
concept of entropy.
Using the value of H two multiplicative factors, mw and mb, are
defined folllowing the rules: If 0.25 < H < 0.30, then mw = 1
and mb = 2.6 If H 0.25, then mw = 2 and mb = 3 If 0.30 H <
0.305, then mw = 1 and mb = 2 If H 0.305, then mw = 0.8 and mb =
0.8
-
These values of mw and mb were found empirically after several
experiments. By now, they can be applied to images of historical
documents only. For any other kind of image, these values must be
analysed again. We empathise that this new algorithm was developed
to work with images with characteristcs of historical kinds.
The greyscale image is scanned again and each pixel i with
colour[i] is turned white if:
(colour[i]/256) (mw*Hw + mb*Hb) Otherwise, its colour remains
the same (to generate a new greyscale image) or it is turned to
black (generating a monochromatic image).
This condition can be inverted generating a new image where the
pixels with colour corresponding to the ink are eliminated,
remaining only the pixels classified as paper. This new
segmentation algorithm was used for two kind of applications: 1) to
create high quality monochromatic images of the documents for
minimum space storage and to efficient network transmission and 2)
to look for better hit rates of OCR commercial tools. The
application of the algorithm to the sample document of figure
1-left can be found in figure 5 next.
Figure 5. Application of new segmentation algorithm in
the document presented in figure 1-left. Comparing figures 2, 3,
4 and 5, one can observe
that the algorithm proposed in this paper yielded the best
quality image, with most of the back-to-front interfernce removed.
It is also important to notice that the new algorithm presented the
lowest processing time amongst the algorithms analysed.
The entropy filtering presented here was used in a set of 40
images of documents and letters from Nabucos bequest. Only four
times unsatisfactory images were produced which required the
intervention of an operator.
Figure 6 zooms into one of these documents and the output
obtained.
For typed documents (also from Nabucos file) the segmentation
algorithm was applied in search for better responses from OCRs
commercial tools. In previous tests [6], the OCR tool Omnipage [11]
from Caere Corp. achieved the best hit rates1 amongst six
commercial analysed softwares. These rates reached almost 99% in
some cases. When applied to historical documents, however, this
rate decreased to much lower values. The segmented images for a
sample typed document can be seen in detail in figure 7.
Figure 6. (top left) Original image; (top right) Original image
in black-and-white; (center left) Original image
segmented by Puns algorithm; (center right) Application of Kapur
et als algorithm; (bottom left) Johannsens algorithm and (bottom
right) our algorithm applied to
original image.
The table below presents the hit rate of Omnipage for four typed
documents representative of Nabucos bequest after segmentation with
the four entropy-based algorithms presented here. They are compared
with the use of the original image with no pre-processing besides
the one used by the own software (the column labeled as
Omnipage).
It can be seen a little degradation in one of the cases in the
hit rate of the software when compared with its use after the
application of the new segmentation technique (in the D064 image).
This degradation can be justified by a possible loss of part of
some characters in
1 Number of characters correctly transcribed from image to
text
-
the segmentation process producing errors in the character
recognition process. Eventhough the segmentation algorithm proposed
in this paper reached the best rates in average.
Image Omni
Page
Johannsen Pun Kapur
et al
New
Scheme
D023 80.3 78.3 43.3 91.7 91.4
D064 84.4 84.5 63.7 85.2 80.1
D077 80.1 80.1 71.8 77.3 92.4
D097 75.4 5.1 69.5 73.4 88.0
Table 1. Hit rate of Omnipage for images of typed historical
documents in percentage
Figure 7-bottom shows another application of the algorithm, as
explained before, where the frequencies classified as ink are
eliminated remaining only the background of the image (the paper).
This image is used in another part of the system in the generation
of paper texture for historical documents [7].
Figure 7. (top left) Original image, (top right) segmented image
(ink) and (bottom) negative segmentation (paper)
The algorithm was also tested with other
segmentation methods such as iteractive selection yielding
better results in terms of OCR hit rates and visual inspection of
monochromatic images quality.
IV. CONCLUSION
This paper introduces a new segmentation algorithm for
historical documents, which is particularly suitable to reduce
back-to-front noise of documents written on both sides. Applied to
a set of 40 samples from Nabucos bequest it worked satisfactorily
in 90% of them
producing, under visual inspection, better quality images than
the best known algorithms described in literature. The automatic
image-to-text transcription of those documents using Omnipage 9.0,
a commercial OCR tool from Caere Corp. [11], improved after
segmentation. The algorithm presented did not work well with very
faded documents. We are currently working on re-tunning the
algorithm for this class of documents.
V. REFERENCES
[1] N.Abramson. Information Theory and Coding. McGraw-Hill Book
Company, 1963.
[2] R.Gonzalez and P.Wintz. Digital Image Processing. Addison
Wesley, 1987.
[3] J.N.Kapur, P.K.Sahoo and A.K.C.Wong. A New Method for
Gray-Level Picture Thresholding using the Entropy of the Histogram,
Computer Vision, Graphics and Image Processing, 29(3), 1985.
[4] S.Kullback. Information Theory and Statistics. Dover
Publications, Inc.1997.
[5] R.D.Lins et al. An Environment for Processing Images of
Historical Documents. Microprocessing & Microprogramming, pp.
111-121, North-Holland, 1995.
[6] C.A.B.Mello & R.D.Lins. A Comparative Study on
Commercial OCR Tools. Vision Interface99, pp. 224-323.Quebc,
Canada, 1999.
[7] C.A.B.Mello & R.D.Lins. Generating Paper Texture Using
Statistical Moments. IEEE International Conference on Acoustic,
Speech and Signal Processing, Istanbul, Turkey, June 2000.
[8] J.R.Parker. Algorithms for Image Processing and Computer
Vision. John Wiley and Sons, 1997.
[9] T.Pun, Entropic Thresholding, A New Approach, C. Graphics
and Image Processing, 16(3), 1981.
[10] Adobe Systems Inc. http://www.adobe.com
[11] Caere Corporation. http://www.caere.com
[12] Nabuco Project. URL: http://www.di.ufpe.br/~nabuco.