text alignment

8/12/2019 text alignment

1/14

International Journal of Document Analysis (2007) 10(1):3952DOI 10.1007/s10032-006-0019-8

O R I G I N AL PA P E R

E. Micah Kornfield R. Manmatha James Allan

Further explorations in text alignment with handwritten documents

Received: 21 December 2004 / Revised: 24 July 2005 / Accepted: 15 March 2006 / Published online: 11 May 2006c Springer-Verlag 2006

Abstract Todays digital libraries increasingly include notonly printed text but also scanned handwritten pages and

other multimedia material. There are, however, few toolsavailable for manipulating handwritten pages. Here, we ex-tend our algorithm from [5] based on dynamic time warping(DTW) for a word by word alignment of handwritten doc-uments with (ASCII) transcripts. We specifically attempt toincorporate language modelling and parameter training intoour algorithm. In addition, we take a critical look at our eval-uation metrics. We see at least three uses for such alignmentalgorithms. First, alignment algorithms allow us to producedisplays (for example on the web) that allow a person to eas-ily find their place in the manuscript when reading a tran-script. Second, such alignment algorithms will allow us toproduce large quantities of ground truth data for evaluatinghandwriting recognition algorithms. Third, such algorithmsallow us to produce indices in a straightforward manner forhandwriting material. We provide experimental results ofour algorithm on a set of 100 pages of historical handwrit-ten materialspecifically the writings of George Washing-ton. Our method achieves average F -measure values of 68.3on line by line alignment and 57.8 accuracy when aligningwhole pages at time.

Keywords Aligning handwriting and transcript DynamicTime Warping

1 Introduction

A number of todays digital libraries contain handwrittenmaterial. Some of these libraries include both handwrittenmaterial and ASCII transcripts. An example of such adigital library is the Newton Project (http://www.newtonproject.ic.ac.uk) that proposes to create ASCII transcripts

E. M. Kornfield R. Manmatha J. Allan (B)Center for Intelligent Information Retrieval, Department ofComputer Science, University of Massachusetts Amherst, MA, USAE-mail: {emkorn, manmatha, allan}@cs.umass.edu

for Newtons handwritten manuscripts. Such historicalmanuscripts are hard to read. A word by word alignment

of the transcript and the handwritten manuscript wouldallow a person to easily read the manuscript. It would alsoallow him or her to find their place in the manuscript usingthe transcript. For example, one could display both themanuscript and the transcript and whenever the mouse isheld over a word in the transcript, the corresponding wordin the manuscript would be outlined using a box.

Such alignments have other applications. One such ap-plication is the ability to create ground truth data for evalu-ating handwriting recognition and retrieval algorithms [11].Effectively producing ground truth data for large collectionsof handwritten manuscripts is a manually intensive and la-borious process that requires a person to first create a tran-script based on the entire manuscript and then label indi-vidual words. The process of labelling can be avoided ifalignment algorithms are available. Alignment also allowsus to create an index for the manuscript. Specifically, thisallows one to search the manuscript by searching its ASCIItranscript. The alignment can then be used to highlight thesearch terms in the manuscript (as is done with conventionaltext search engines).

Creating such alignments is challenging since thetranscript is an ASCII document while the manuscript pageis an image. Handwriting recognition is not accurate enoughto recognize such large vocabulary historical documentcollections. We therefore propose an alternative approachto aligning such material. The handwritten page image isautomatically segmented. Features (for example box andtext position, aspect ratio etc) are then computed for boththe transcript and the page image. An algorithm based ondynamic time warping (DTW) is then used to align thewords on the page image and the transcript. We computealignments for whole pages and also for situations in whichone can assume that the beginning and end positions oflines are known. We show results on a set of 100 pages fromGeorge Washingtons handwriting.

Alignment is difficult because every step in the abovementioned approach produces errors. Segmentation of hand-


2/14

40 E. M. Kornfield et al.

writing is known to cause errorsboth over and undersegmentation occur. Since our corpus consists of scannedimages of old historical documents, there are even moreerrors. In addition, the alignment algorithm itself produceserrors.

Our prior work [5] consisted of using a DTW modelfor performing alignment. It assumed that images must be

aligned with ASCII text from the transcript. In this paperour research is extended in several ways. First, we take acloser look at the estimation of our DTW features. Sec-ond, we introduce language modelling into our frameworkby adapting methods described in [4]. Third, we train valuesfor feature weights and local path costs. Fourth, we exam-ine the possibility of using a more complex local continuityconstraint. Fifth, we introduce one more evaluation measureand provide empirical analysis of our proposed evaluationmeasures. Finally, we report results on a larger number ofdocuments (100 vs. 70).

The remainder of this paper is organized as follows.Section2discusses related work and how our approach dif-

fers. We then continue by formally defining the problem andnotation used for the rest of the paper in Sect.3. In Sect. 4 wediscuss the format of our data. Several baseline algorithmsare discussed in Sect. 5. Section6 goes over different eval-uation metrics for the alignment tasks. Our DTW algorithmis described in Sect. 7. We conclude with experimental re-sults in Sect.8 and our conclusions along with a discussionof future research paths in Sect. 9.

2 Previous work

2.1 Historical documents

Very little research has been done on aligning transcripts ofhistorical documents. As far as we know few people have ex-amined the problem of aligning transcripts with handwrittendocuments.

Tomai et al. [14] investigated aligning transcripts withhandwritten documents. The method they propose is to limitthe lexicon of a handwriting recognizer by using the tran-script. A ranked list of possible words from the lexicon isreturned for each recognized word image. Several differentlikely segmentations of a line are made. The segmentationthat has the highest probability given the transcript and pre-

vious alignments is then used. If a mapping cannot be per-formed with high enough confidence for a word then it is leftout.

Tomai et al. give a figure of 82.95% accuracy in map-ping words to a page. However, this figure makes certainassumptions. First they exclude 32 of the 249 words due totheir extreme noisiness. Including all words, their accu-racy is roughly 72%. Second, they mention that of the 180words they map, 17 are exactly mapped and 163 roughlymapped. In the absence of other information, we are un-able to decide what the term roughly mapped means andwill assume that all 180 words were accurately mapped from

transcript to manuscript. Finally, the results are reported fora single page of handwriting.

There has also been work in the areas of AutomaticSpeech Recognition (ASR) [12] and machine translation [4]on alignment. We note that these problems are somewhatdifferent. For example, in machine translation, the alignmentis between ASCII text in two different languages and addi-

tional constraints in terms of dictionary and grammar areavailable that are not available for word images.

2.2 Optical character recognition (OCR)

The document recognition community [2] has done researchinto aligning transcripts with machine printed documents forthe purposes of creating ground truth.

For example [2] tries to find a geometric transformationbetween the document description and the image of the doc-ument which minimizes a cost function. This technique as-sumes that along with the transcript there is a page descrip-tion that denotes where the words in the transcript appearon the page. The most information that might be availablein existing transcripts of historical documents is where linebreaks occur. This limited information does not appear to besufficient to make use of the algorithm proposed.

Another technique was proposed by Ho and Nagy [1].Their proposed algorithm uses a predefined lexicon to helprecognize characters. Ho and Nagys algorithm is to seg-ment a printed page into individual characters and clustereach of the segments. After clustering, character labels areassigned to the clusters by finding mappings that maximize av/p ratio. The v/p ratio measures how well a set of mappingsmatches the lexicon. This technique is not directly appli-

cable to our task because in general segmenting individualcharacters from handwritten manuscripts is very difficult.However, the idea of using the word-level language modelfrom the transcripts to make assignments is appealing andsimilar to that of [4] which we use in this paper.

3 Problem definition and notation

Given a digitized image of a page Di (the set of all pages isdenoted by D) we generate a segmentation (Di ) that pro-duces a vector of word images {b0, b1, . . . , bM}. For clarity,

a segmentation actually produces bounding boxes for a dig-itized image, the pixels within a bounding box comprise aword image. We also have a transcript Ti that is a vectorof ASCII words {w0, w1, . . . , wN} for each page. For eachbm (Di ) we wish to select a set Wm of words from thetranscript (Wm Ti {}) such thatWm contains the ASCIIequivalent to what is represented by the word imagebm . Anexample of a handwritten page and a perfect alignment forthe page is shown in Fig.1.

When performing alignment we can view a segmenteddocument (Di ) as containing multiple lines. Transcripts,however, might not contain such line breaks. In general,


3/14

Further explorations in text alignment with handwritten documents 41

Fig. 1 Handwritten page and perfect alignment

when we refer to (Di ), we view the entire document as

one long line. This is accomplished by placing each suc-cessive line at the end of the previous line. For example, ifwe have two lines {b1, . . . , bn}and {bn+1, . . . , bm}. We ad-

just every bounding box in{bn+1, . . . , bm}to have the samebaseline (y-coordinate) as the first line ({b1, . . . , bn}) and ad-

just the starting x-coordinate of each box in the second lineby adding thex-coordinate of the end of image bn .

Sometimes transcripts will have line break informa-tion. In this case, it is useful to remove the abstractionof a single long line and refer to specific lines. We de-note this as l((Di )) where l indicates that we are in-terested in only the bounding boxes on the lth line. Sim-ilarly l(Ti ) denotes we are interested only in the ASCII

words on the corresponding line l of the transcript. |(x)|gives the count of lines in either transcript or segmentationdata.

4 Data

Our data consisted of 100 digitized pages from GeorgeWashingtons archive. For each page we have two differenttypes of segmentations with annotations and a line alignedtranscript.

Table 1 Number of bounding boxes and lines in our evaluation data

Segmentation Number of boxes Number of lines

Automatic(auto) 25,213 3,379

Manual(hand) 24,671 3,425

4.1 Segmentation ()

The segmentation produces a list of bounding boxes thatwhen applied to the image should isolate all the pixels thatare part of a single word. For each bounding box we havethe coordinates that defines a rectangle and an indicator ofthe line in the digital image the bounding box occurs on.

The two different types of segmentation are described below.Figure2shows each type of segmentation. Table 1containsthe number of boxes and lines in the segmentations for the100 pages.

Automatic segmentations (auto) Automatic segmenta-tions are those generated automatically by a programthat is an improved version of [8] described in [7].These segmentations are not perfect and can containfour different types of mistakes:1. Bounding boxes will sometimes be placed around ar-

tifacts on the page that are not real words.


4/14


Fig. 2 An example of automatic and manual segmentation

2. Some words might have no bounding boxes placedaround them.

3. Bounding boxes are sometimes placed around morethen one word (under segmentation).

4. A word can sometimes be split into more then

one bounding box (over segmentation), or only bepartially included in a bounding box.

Manual segmentations (hand) Manual segmentations arecorrections of automatically segmented pages. For eachpage an annotator corrected the automatic segmenta-tions to create a one-to-one and onto mapping of wordsfrom the transcript to bounding boxes. Words in thiscase are strings made from all alphanumeric charac-ters. It is important to note that with manual segmen-tations, alignment is trivial. We only use these segmen-tations for validating different aspects of our system.That is, the manual segmentations are considered ground

truth.

4.2 Annotations(A[(D)])

Annotations consist of vectors of ASCII strings for eachbounding box in a segmentation. These labels provide uswith the true value of the contents of each bounding box,that can be used to evaluate how well or poorly an alignmentalgorithm works.

For manually segmented documents an annotation issimply the ASCII text equivalent of the word in the bounding

box. Automatically segmented pages have a slightly richerrepresentation to account for possible errors in the segmen-tation. For each bounding box that contains one or morewords, the string labels are the exact text that is locatedwithin the bounding box (if a bounding box only covers part

of a word, only the part covered is included). If a boundingbox only contains part of a word, then in addition to exactlywhat is contained inside the box, we also record the com-plete word that was split by the box.

4.3 Transcripts (T)

A transcript is an ASCII text file consisting of text that corre-sponds to a specific page. Each file is aligned in parallel, onthe line level, with the two different segmentations above. Atranscript for a document is the same thing as an annotationfor a hand segmented document image with some additional

punctuation. It contains an exact match for the text in thedocument image.

Figure 3 contains an example transcript for the threelines contained in Fig.2.

Fig. 3 Example transcript


5/14


6/14


are equal if they are the same length and all correspondingcharacters are equal. So

exact(bj) =

1 |Sj| = |Wj|,i : {1 i |Sj|}si= wi

0 otherwise (3)

Exact matching is very strict. For a perfect score, it re-

quires algorithms to not only give a reasonable alignment,but to trim words from the transcript to fit poorly segmentedwords and split words if a segmentation splits the word.This type of measure is probably best used when evaluat-ing alignments for use as training data for other retrievalmethods.

6.2 Edit distance matching (ED)

Exact match is a rigorous evaluation measure, and mightnot be suited to all applications of the alignment algorithm.We therefore propose a more relaxed definition of what

it means to get an alignment for a bounding box correct.If we concatenate the strings in both our annotation fora bounding box and the aligned text for the box we canthen use the value returned by Eq. (4) for the two result-ing strings to judge if a bounding box has the correct text init.

ED(s1, s2) =

1 max(|s1|, |s2|)/2> ED(s1, s2)

0 otherwise (4)

where ED(s1, s2) is the edit (Levenshtein) dis-tance [6] between the two strings (s1 and s2). Theedit distance between two strings is given by therecurrence:

ED(, ) = 0

ED(s, ) = ED(, s) = |s|

ED(s1+ c1, s2+ c2) = min

ED(s1, s2)+(c1, c2),

ED(s1+ c1, s2)+1,

ED(s1, s2+ c2)+1

where c1, c2 are characters, s is a string and (c1, c2) re-turns zero if the characters are equal and 1 otherwise.Edit distance matching is more relaxed then exact match-

ing. By counting bounding boxes as correct if the wordsmostly match (the edit distance is less than half of the max-imum of the lengths of the strings which are compared),it better reflects the case of using alignments for direct re-trieval. It also give a little bit of leeway in case of an-notation and transcript discrepancies caused by typograph-ical errors in the creation of either set. So if we define({st1, . . . , stn}) to be the concatenation of a set of stringsthen

ED(bj) = ED((Sj), (Wj)) (5)

6.3 Precision-recall and F -measure(Precision,Recall,F-measure)

Recall and precision are measure commonly used in the in-formation retrieval domain. We can extend them to align-ment evaluation by calculating each of the metrics on abounding box level. Precision is then defined as:

precision(Sj, Wj) =|SjWj|

|Wj|(6)

(the proportion of the words in the assignment that matchthe annotation) and recall as:

recall(Sj, Wj) =|SjWj|

|Sj|(7)

the proportion of the words in the annotation that arematched. So Precision(bj) = precision(Sj, Wj) and Recall(bj) = recall(Sj, Wj).

The F-measure is another commonly used metric used

for information retrieval. It was proposed to make compari-son of systems easier by combining recall and precision intoa single number. The general F -measure is defined as:

F-measure(bj) =1

1Precision(bj)

+(1 ) 1Recall

where is a constant that weights either precision and recalldepending on their relative importance in the evaluation. Inour case we use the standard setting of =0.5 so that:

F-measure(bj) =2Precision(bj)Recall(bj)

Precision(bj)+Recall

6.4 Tomai et al. evaluation

The evaluation metric that is used by Tomai et al. [14],is slightly different in flavor then any of our proposedevaluation metrics. Instead of looking at bounding boxesand determining which words are placed correctly within agiven box, they look at each transcript word and determineif the box it is mapped to contains the correct image. Moreformally for each word-box pair(wi , b

autoj ), the mapping is

considered correct ifwi = Sk A[hand(Di )]and

Ytopbautoj

Ytop

bhandk

Ybottom

bautoj

Ybottom

bhandk

Xstart

bautoj

Xstart

bhandk

Xend

bautoj

Xend

bhandk

Their score is calculated as the number of correct map-pings divided by the size of the transcript. After careful con-sideration we believe that Tomai et. als evaluation method


7/14


does not provide a good metric for how we have definedour task. Specifically, the constraints of the evaluation metricthat ensure boxes are placed correctly directly conflicts withour notion of trying to determine which segments containmultiple words. When we integrate our alignment and seg-mentations systems (see Section 9) then this metric wouldbe directly applicable to give us more a more complete eval-

uation of the new system.

6.5 Averaging

For any of the measures above, we can average the evalu-ation in three different ways. The first is over documents:(10).

|D|x=1

|(Dx)|i=1

(bi )

|(Dx)|

|D|(8)

That is, each page Di is weighted equally. Recall that D isthe set of handwritten document, Di is a page, l((Di ))isa line andbi is a word image. We can also weight each lineequally:

|D|x=1

|(Dx)|i=1 (

byi (i (Dx))

(by))|D|x=1|(Dx)|

(9)

or each word image equally:

|D|x=1

|(Dx)|i=1 (bi )

|D|x=1|(Dx)|

(10)

6.6 Stop words and evaluation

Evaluation depends upon the end-goal for our algorithm.We must decide which word alignments are important tous. For instance, when doing retrieval it is not important tomatch stop words correctly (because most retrieval systemsremove them from a query in a preprocessing step). Onthe other hand, when creating ground truth data, it is moredesirable to do well on all words in the document. Wetherefore analyze not only the overall system performance,but the performance after removing stop words from theevaluation as well.

7 Dynamic time warping

Dynamic Time Warping (DTW) is an algorithm for aligningtwo time series by minimizing the distance between them.A time series is a list of samples taken from a signal, orderedby the time that the respective samples were obtained. Forour alignment task, we view each ASCII word in a transcriptand each box in a segmentation as the samples that make upthe two time series we are concerned with.

Fig. 4 Two similar time series aligned via Dynamic Time Warping.The lines between the two time series depict the assignment of corre-sponding points between the two time series

Rather than mapping samples that have the same timeindex to each other, DTW allows for the fact that one timesignal may be warped with respect to the other. An exam-ple of an alignment for two series can be seen in Fig. 4.The name Time Warping is derived from the fact that thisalignment warps the time axes of the two series so thatthe corresponding samples more closely relate to our intu-ition of what a good alignment should be. Intuitively what

this means is for each possible assignment of some wi tobjwe try to determine whether we should move forward inone or both of the time series to make an optimum assign-ment (one that minimizes cost) between subsequent samplepoints. The actual set of positions we can move to in onestep of the algorithm is known as the local continuity con-straint. In [5] we assumed that at each point we could eithermove forward a single step in both the word images ((Di ))and in transcripts (Ti ), or one of them individually. This re-quired that no word or word-image would be left out of thealignment. In this work, we expand the local continuity con-straint to allow for moves of both one and two in any direc-tion (see Fig.5 for a graphical depiction of this constraint).

Intuitively, this new constraint relaxes the original constraintby allowing the algorithm to skip a box, word, or both in theassignment process. This has the ability to aid in alignmentby possibly detecting if a word was never segmented or ifa word image contains garbage. Such occurrences are rare.

Fig. 5 A graphical depiction of the new local continuity constraint forDTW (). To find the minimum cost path for word-image i and a tran-script word jthe algorithm examines previous points along the paththat are a maximum of two units away along either axis


8/14


For example, less than 1% of the words in a page are missedfor the word segmentation algorithm used here [7]. Consec-utive occurrenceswhich would require two skipsare evenrarer and hence are not considered.

Let the DTW cost between two time seriesb1. . . bMandw1. . . wN be DTW(M,N). DTW(M,N) is calculated us-ing the following recurrence relation:

DTW(i, j) = min

DTW(i, j 1)+1

DTW(i 1, j)+2

DTW(i 1, j 1)+3

DTW(i, j 2)+4

DTW(i 2, j)+5

DTW(i 2, j 2)+6

DTW(i 1, j 2)+7

DTW(i 2, j 1)+8

+d(bi , wj) (11)

whered(bi , wj)is our sample-wise cost measure:

d(bi , wj)=

||k=1

k k(bi , wj) (12)

k(b, w)is thekth word-box cost feature used (see Sect.7.2)and kis a weight for the feature. The s are costs associ-ated with moving in the given direction of the warp. Thedirectional costs (x) are present to protect against the DTWalgorithm skipping as many points as possible to minimizecost. Additionally, the directional costs can be useful in boththe original and new continuity constraints, by biasing thealgorithm to move in a given direction. Both the s and sare determined by training (see Sect. 7.1).

7.1 Training

We used the Downhill-Simplex Algorithm ([9]) for trainingall the weights in our system ( and ). Downhill-Simplexis essentially a form of hill-climbing. It seemed well suitedto our task for two reasons. First, it does not require explicitknowledge of the gradient. Second, it converges relativelyquickly when compared with other learning techniques suchas genetic algorithms. We used the F-measure (see Sect.6)evaluation technique as an objective function for Downhill-Simplex.

An important aspect of DTW is that we constrain how

much each of the time axes can be warped. This has atwofold effect. First, it reduces computation time for the al-gorithm. Second, it disallows large warps. By a large warp,we mean either assigning a single word to a large num-ber of boxes, or a large number of words to a single box.This constraint is known as a global path constraint. Thereare a variety of ways that the global path constraint can beimplemented. We chose to use the SakoeChiba [13] bandconstraint that simply limits how far off the diagonal analignment can move (see Fig. 6). The algorithm must sat-isfy both the global path constraint and the local continuityconstraints. So for example, positions which satisfy the local

Fig. 6 SakoeChiba path constraint with width ron the dynamic pro-gramming table. (1, 1) indicates the first point evaluatedin the dynamicprogram.(|Ti |, |(Di )|)is the final point evaluated in the dynamic pro-gram (i.e. the end of both the text transcript and the segmentation).Only points within the shaded regions are evaluated in the dynamicprogram

continuity constraints will be eliminated from consideration

if they lie outside the SakoeChiba band.Pseudocode for the algorithm is given in Fig.7. Assign-ments are made by back tracking through the dynamic pro-gramming table starting at point (|Ti |,|(Di )|) and findingthe preceding minimum point as defined by the recurrence.

7.2 Word-box features

Word-box features are used in calculating the cost of assign-ing a word to a given bounding box in DTW. Any combina-tion of the features listed below can be used when runningdynamic time warping. We used two distinct types of fea-tures. The first type relies on computing scalar features overthe word images and ASCII text. Once we have feature val-ues corresponding to each word in the transcript and imagein the segmentation, we can then calculate the cost of anyword-box pair using a suitable cost measure. In this case kfrom Eq. (12) is defined as cost(fk(wi ), fk(bj))where fk isa feature below and cost is a cost function. There are manypossible cost functions that can be used. [5] determined thatin general an absolute difference (cost(x,y) = |x y|)works best.

Aspect ratio For an imagebwe calculate the aspect ratio asYbottom(b)Ytop(b)

(b) . There are two possible ways to calculate

aspect ratio for text. The first is by rendering the text

in a script font and performing the computation on thebounding box of the rendered word. The second is totake the height of a word to be constant and divide bythe number of characters in the word.

Width For a word image width is calculated as (bj)b(D)(b)

.

Similar to aspect ratio, there are two methods for calcu-lating the width of ASCII words. The first is by renderingall the words and performing the computation on the ren-dered text images. The second is to use character countas the width, and perform the normalization based uponthe total number of characters in the transcript.


9/14


Fig. 7 Pseudocode for DTW (adapted from [15])

Character position We use Eqs. (1) and (2) to computecharacter positions. An alternative to calculating ASCIIcharacter position is to render all the words and use theanalogue of Eq. (2) on the rendered words.

Ascender count Some characters have ascenders that ex-tend above other characters. For instance capital letters,

l and d have ascenders. An estimation technique[3] is used to try to determine the number of ascen-ders for a word image. Characters with ascenders can bedirectly counted for words from the transcript. All val-ues are normalized to be between 0 and 1 with a meanof 0.5.

Descender count Some letters have descenders that extendbelow the baseline. For instance, g and y have de-scenders. The same techniques for finding ascenders isused for finding descenders in images and words. All val-ues are normalized to be between 0 and 1 with a mean of0.5.

The second type of cost feature does not explicitly ex-tract two scalar values that can be compared with a simplecost function. Instead the cost for assigning a given word toan image is more complex. Two features we looked at were:

Stop word matching Stop word matching (SWM) gives afixed penalty if we believe a word image contains astop word (a, the, etc.) and the corresponding ASCIIword is not aligned with the image. We target stop wordsbecause a relatively small number occur with a high fre-quency through out English documents. Our belief of thecontents of a word image is based on trained clusteringof all word images offline. More specifically we have aset of labeled clustersC such thatc C has a label rep-resenting the words in the cluster (i.e. the, a, etc).cis composed of a set of word images. S(wi , bj) is de-fined as follows: if c C such that bj c then ifwi =label(c)add a fixed penalty. Otherwise add zero.Clustering for stop word matching was done as follows:

1. Randomly arrange all word images we wish to clus-ter.2. Using training data, build a cluster for each of the

words we are interested in recognizing. We choosethe most frequently occurring words that comprise50% of the corpus. This is feasible due to the Zipfiandistribution of documents in the English language.

3. Take the next image, bi , to calculate its distancefrom each cluster: Find mincC (dist(bi ,(c))) andargcmincC (dist(bi ,(c))). Where dist is the DTWdistance [10] between the centroid of the cluster,(c), and the image.

4. If the distance in step3is less then a threshold (ob-tained through experimentation) then assign the im-agebi to clustercand update the centroid. Otherwisediscard the example.

5. If there are more images to cluster go to step3.

Word co-occurrence model We attempt to model the co-occurrences between word images and transcript wordsas an additional feature for our DTW algorithm. Weadapt the algorithm proposed by Kay and Roscheisen [4]for our task. The original algorithm is meant for align-ing sentences in a pair of parallel corpora of text in twodifferent languages. To adapt the algorithm for our pur-poses we first need a vocabulary of visterms (a labelingof word images that allows us to refer to similar word

images by the same label) to describe the word images.Creating visterms was accomplished by using untrainedclustering. The untrained clustering algorithm is thesame as that used for SWM with the exception that ifthe word image is not added to an existing cluster, a newcluster is added to the pool of clusters with its centroidset to the value of the word-image. Our visterm vocabu-lary then consists of cluster labels (each label is an arbi-trarily assigned number for each cluster).

A second consideration when adapting the algorithmfor our needs was how to determine sentence boundaries.A priori we have no knowledge of sentence boundaries


10/14


in (Di ) With transcripts we might have the necessarypunctuation but it is possible that some documents mayhave been transcribed without punctuation. Another pos-sibility might be to consider a line as a sentence, but ingeneral as mentioned above transcripts might not haveline boundary information. To solve this problem we saythat every sentence is simply a unigram from either the

transcript (a single word) or the document image (a sin-gle segment).

The algorithm [4] is an iterative process where in eachiteration we make a hypothesis about which words fromthe parallel corpora correspond to one another. We thenuse these assignments to narrow down the choice ofother possible assignments in the next iteration. The al-gorithm works as follows:

1. Enumerate all of the possible assignments of wordsto word images subject to a skew constraint and fixedpoints.

The skew constraint limits the range of possibleassignments by making the assumption that wordsand their corresponding word images are less then acertain distance away. Specifically the constraint en-forces that for a possible assignment of a word wi toa word imagebjthe following must hold|ji | .is the possible skew between the two corpora. Theconstraint is analogous to the global path constraintin DTW.

Fixed points include page boundaries and corre-spondences between words and word images deter-mined in an earlier iteration of the algorithm. Fixedpoints add an additional limitation on possible as-signments. The limiting is achieved by further prun-ing possible assignments subject to the constraintthat an assignment does not cross a fixed point. Forexample, consider a fixed point(x,y)(wherexand yare indices of a word and visterm respectively) thereare no possible assignmentswi , bj wherei xandj > y or vice versa.

2. For each word-visterm pair find the likelihood thatthe word and visterm correspond. Afterwards pruneunlikely word-visterm pairs and create a list sortedby descending correspondence likelihood.

For every word-visterm pair,(lm, vn), wherelm isa word in the lexicon of T andvn is a visterm createdby clustering, calculate the statistical similarity of thepair. Similarity is calculated by:

(lm, vn) =2

NT(lm)+ N(vn)(13)

whereis a count of the number of times the wordand its visterm might co-occur throughout the entirecorpora (as determined by the possibilities that wereenumerated in Step1), andNJ(x)is the frequency ofxin corpus J.

Eliminate all word-visterm pairs which are belowa threshold. In order to find high-quality correspon-dences we use a threshold of standard deviations

from the mean similarity of all word pairs, where is an adjustable parameter of the algorithm.

In addition to the thresholding we impose the re-quirement that all word pairs (lm, vn)satisfy the fol-lowing condition: N(vn) > 1,NT(lm) > 1. Thisconstraint eliminates singleton pairs. This is neces-sary because every singleton word-visterm pair has

a similarity () of 1.0, so by including them one isalmost guaranteed to generate spurious matches.

After calculating and culling word-visterm pairswe group them together by the frequency of words(NT(lm)). In our implementation we had threegroups, one for NT(lm) 35, one for 35 >NT(lm) 10 and one for 10 > NT(lm) The num-bers are estimated by looking at a Zipfian distribu-tion of the words in the transcript. 35 is close to theknee. The 10 is much lower down and is a conserva-tive estimate since some of the corresponding boxesin the word image may be broken up. Within eachgroup word-visterm pairs are sorted by descending

similarity. Intuitively the different groups are a mech-anism for providing different levels of confidence inthe similarity score. For words with high frequencieswe are more confident that our estimates of their sim-ilarities are accurate. In contrast, words with lowerfrequencies are more likely to co-occur as a randomevent. Therefore, the final step in making the list is tosimply order the groups from highest to lowest fre-quency and save the result.

3. Extract possible assignments from the list created inStep2.

Read through the list sequentially, for each word-visterm pair identify the possible assignments from

Step 1 that correspond to the pair. Specifically fora word-visterm pair (lm, vn) find all assignments(wi , bj)such that wi = lm andvn is the visterm forbj. If all the assignments do not conflict (conflictingmeans that an assignment crosses a fixed point as de-fined in step 1) with any previous assignments, keepthe assignments for(lm, vn).

4. Make all the assignments kept in Step3fixed points.5. Iterate through steps1through4until no new entries

are added as alignments in Step3. Once this occurslower. For our implementation we use 1.0 2.0.

After running this algorithm, we have a list of word-

visterm pairs. We then assign penalties using the samemethod from Stop Word Matching for each word imagethat is described by a visterm in the list.

8 Experiments and results

Our experiments consisted of performing alignments foreach transcript with both auto and hand. For each com-bination of transcript and segmented image we tested line-by-line (using line break information) alignment as well as


11/14


Table 2 A comparison of evaluation metrics illustrated via several baselines

Upper bound Linear alignment (from front) Character position ( T)

Exact match 81.6 55.0 45.3Recall 81.9 54.3 45.8Precision 81.9 57.8 48.4F-measure 81.9 55.4 46.4Edit distance 87.9 64 48.9

Table 3 A comparison of rendered and unrendered features using the F -measure

Rendered Unrendered

Line by line Page at a time Line by line Page at a time

Aspect ratio 56.0 35.1 45.9 8.8Character position 62.4 6.8 61.9 6.7Width 62 50.2 61.9 48.8

page at a time alignment (ignoring line break information).We first determined which method (character based or ren-dered) of computing aspect ratio, character position, andwidth provided better features for DTW alignment. We con-tinued by comparing Stop Word Matching and the Word Co-occurrence Model as features. Afterwards, we attempted totrain weights () for each feature in the cost function. Fol-lowing the training of weights we looked at the effects oftraining directional path costs () in both the original con-tinuity constraint (see [5]) and the extended constraint de-scribed in Sect.7. For all experiments we used fivefold crossvalidation on 100 pages. We had separate training runs forcases where we used line break information and those inwhich we ignored line break information.

8.1 Metric comparison

Before discussing our experimental results an examinationof our different evaluation methods is in order. When exam-ining the range of values of the different metrics discussedin Sect. 6 we see some general trends. Experimental dataconfirms our intuition about exact match and edit distancemetrics. Across the board the edit distance metric evaluatesalignments with the highest percentage correct. Also, the ex-act match measurement tends, in general, to give the mini-mum score for an alignment. In addition, precision, recalland the F-measure techniques tend to be fairly close to one

another and end up being someplace within the range of ex-act match and edit distance. For the remainder of the pa-per we report results using the F-measure. The choice ofF-measure stems from two reasons. First, it is fair in thesense that it is a median of our algorithms performance. Itis neither the maximum or minimum measure in any of ourexperiments. Secondly, it encompasses both recall and pre-cision giving a better idea of what can actually happen in thesystem. Table2 shows the different results that occur whenapplying different metrics to three baseline alignment types.Out of the different averaging methods we chose to use boxlevel averaging. The motivation for using this type of av-

eraging is ultimately we care most about how well we canalign the words with the boxes.

8.2 Rendered vs. unrendered features

In order to test which methods for calculating DTW fea-tures, rendered or unrendered, were superior we tried run-ning DTW using only a single feature. For each type of fea-ture we tried both an unrendered and rendered version. Theresults of these runs are summarized in Table 3. It is clearthat rendering text and calculating features based on the ren-dering performs equal to or better than using the simplercharacter based computations. We believe the difference isparticularly pronounced in the case of aspect ratio due to as-

cenders and descenders affecting the height component ofthe measurement significantly.

8.3 Stop word matching vs. co-occurrence model

To evaluate the impact our two features that depend on clus-tering we ran DTW using aspect ratio, width and characterposition as features combined with either stop word match-ing or co-occurrence features. We also performed alignmentruns using both stop word matching or co-occurrence fea-tures and using neither, to evaluate how complementary thetwo features are, and how much impact each provides.

As we can see from Table 4 stop word matching im-proves performance by a higher degree (49.1 vs. 48.8). Butthere is some benefit to using both models (an additional0.4 of accuracy). We believe the overall impact of both fea-tures is low due to poor clustering performance. Intuitivelythe poor clustering also explains why stop word match-ing performs better. Without accurate clustering the co-occurrence algorithm would have a more difficult time find-ing likely matches. However, inaccurate clustering wouldsimply cause some spurious identification of stop words,which can be corrected by the other features included inDTW.


12/14


Table 4 A comparison of stop word matching and the co-occurrence model as features using the F -measure

Line by line Page at a time

Neither stop word or co-occurrence 67.1 48.2Co-ocurrence 67.1 48.8Stop word 67.2 49.1Both 67.2 49.5

Table 5 F-Measure evaluation of basic alignment algorithms on aligning transcripts with automatically segmented pages

Normal Non stop words only


Linear alignment (from front) 46.6 3.6 39.1 1.6Linear alignment (from back) 43.1 7.1 33.3 1.9Character position (T auto) 52.0 7.9 48.3 5.9Character position (auto T) 55.4 8.1 52.4 9.1Upper bound 81.9 81.9 67.6 67.6

Table 6 Results (F-measure) of training on DTW



No training 67.2 55.4 57.7 46.9Feature weight training 67.2 56.3 57.5 47.5Feature and path training 67.2 57.8 57.4 48.7Feature and extended path training 68.3 55.8 56.4 48.1

8.4 ASCII (T) to automatic segmentation (auto)

The results of aligning transcripts with an automatically seg-mented page using the base-line algorithms described inSect. 5 are presented in Table 5. These results are similarwith those presented in [5].

In [5] we noted that when using line break informa-tion the character position feature helps performance butif we do not have the information character position be-comes a hindrance. Keeping this result in mind for therest of our experiments we include all features discussedin Sect. 7.2, including character position when doing lineby line alignment. When doing alignment on pages with-out line break information the same features are used exceptthat we omit character position. Where applicable we ap-ply results from Sect. 8.2 and use rendered estimation forfeatures.

Table 6 summarizes the results of training. Using nor-mal evaluation (evaluating all words) the results indicate

that training only helps the performance of the line by linealignment when we use the extended continuity constraint.

Contrastingly, for alignment without line break informa-tion training we see performance gains using training, butthe smallest gain is realized when using the extended path.Whether we used line break information or not, the perfor-mance on non-stop word alignment seems to be independentof increases in overall system performance.

We also wished to determine to what extent our cluster-ing performance affects our system performance. To deter-mine this we eliminated cluster performance issues by usingperfect clustering (based on box labels). Table7 shows theresults of retraining weights using features based upon theperfect clustering. The results seem to indicate that when wehave line break information, an increase in clustering per-formance will help only a small amount. However, whenaligning entire documents at a time we see a much largerincrease. Intuitively, this makes sense because with line byline alignment we have break points which allow us to restartthe alignment from scratch. However, when aligning a pageat a time stop word matching and the co-occurrence model

serve as pseudo-breakpoints with which that algorithm canin some sense restart itself from scratch.

Table 7 Results (F-measure) of training on DTW using perfect clustering for stop word matching and co-occurrence model





13/14


Table 8 Results (F-measure) of basic alignment algorithms on aligning transcripts with hand segmented pages



Linear alignment (from front) 100.0 100.0 100.0 100.0Linear alignment (from back) 100.0 100.0 100.0 100.0Character position (T auto) 69.8 7.7 79.0 10.2

Character position (auto

T) 84.8 17.9 92.3 23.0Upper bound 100.0 100.0 100.0 100.0

Table 9 Results (F-measure) of training on DTW with hand segmented pages

Normal Non Stop Words Only



8.5 ASCII (T) to manual segmentation (hand

)

When aligning transcripts to hand-segmented pages (seeTable8) we did not retrain any parameters. If we had doneso we would expect that training weights on the path con-straints would have simply forced the algorithm to take thediagonal path on every occasion. DTW as before performsvery well on this task (see Table9). Training enables page ata time alignment to achieve an F -measure of 98.9 accuracy.

9 Conclusion and future work

Our DTW algorithm still outperforms any of the baselinemeasures by a fair margin. Training helps increase this mar-gin slightly more in the case of page at a time alignment.But it seems that we need to augment the model to getfurther system performance. It is possible that a differentlocal continuity constraint than the one presented in this pa-per might help. In addition, different machine learning algo-rithms might be able to find better feature and path weights.More investigation is needed into both of these possibilities.

Our results show that for the page at a time approachperformance increases significantly with improvements inclustering performance. Further investigations into cluster-ing or other methods for recognizing very common words

will help improve our results further. In addition, it wouldbe helpful to start investigation into methods for splittingwords between boxes.

Ultimately, we still foresee the segmentation and align-ment system working as an iterative process where each it-eration refines the output, until no changes occur.

Further areas of research exist in trying to leverage im-perfect transcripts of documents. For instance, it might bemore expedient to read historical documents out loud andhave an automatic speech recognition (ASR) system pro-duce an ASCII transcript. Of course, ASR is not perfect andwill introduce errors in the transcript. Developing algorithms

to deal with the noisiness from both transcripts and segmen-tations will be even more challenging than the problem ad-dressed in this paper.

Another challenging task to be addressed in the areaof alignment is non-standard documents. For instance, it isnot clear that our techniques that assume documents consistof prose, will also adapt to mathematical formulas anddiagrams.

Acknowledgements This work was supported in part by the Centerfor Intelligent Information Retrieval and in part by the National Sci-ence Foundation under grant number IIS-9909073. Any opinions, find-ings and conclusions or recommendations expressed in this materialare the authors and do not necessarily reflect those of the sponsor.

References

1. Ho, T., Nagy, G.: OCR with no shape training. In: Proceedings of15th ICPR, pp. 2730. Barcelona (2000)

2. Hobby, J.D.: Matching document images with ground truth. Int. J.Doc Anal. Recognit.1(1), 5261 (1998)

3. Kane, S., Lehman, A., Partridge, E.: Indexing George Washing-tons handwritten manuscripts. Technical Report MM-34, Centerfor Intelligent Information Retrieval. University of MassachusettsAmherst (2001)

4. Kay, M., Roscheisen, M.: Text-translation alignment. Comput.Linguist.19(1), 121142 (1993)

5. Kornfield, E.M., Manmatha, R., Allan, J.: Text alignment with

handwritten documents. In: Proceedings of DIAL, pp.195211.Palo Alto, California (2004)

6. Levenshtein, V.I.: Binary codes capable of correcting spurious in-sertions and deletions of ones. Russian Problemy Peredachi Infor-matsii1, 1225 (1965) (Original in Russian. English translation inProblems of Information Transmission1, 817 (1965))

7. Manmatha, R., Rothfeder, J.: A scale space approach for automat-ically segmenting words from historical handwritten documents.IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 12121225 (2005)

8. Manmatha, R., Srimal, N.: Scale space technique for word seg-mentation in handwritten documents. In: Scale-Space Theories inComputer Vision pp. 2233 (1999)

9. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: NumericalRecipes in C. Cambridge University Press, Cambridge, UK (1993)


14/14


10. Rath, T., Manmatha, R.: Word image matching using dynamictime warping. In: Proceedings of CVPR-03, vol. 2, pp. 521527.Madison, WI (2003)

11. Rath, T.M., Lavrenko, V., Manmatha, R.: A statistical approach toretrieving historical manuscript images without recognition. Tech-nical Report MM-42, Center for Intelligent Information Retrieval,University of Massachusetts Amherst (2003)

12. Roy, D.K., Malamud, C.: Speaker identification based text to au-

dio alignment for an audio retrieval system. In: Proceedings ofICASSP 97, pp. 10991102. Munich, Germany (1997)

13. Sakoe, H., Chiba, S.: Dynamic programming optimization for spo-ken work recognition. IEEE Trans. Acoust. Speeh Signal Process.26, 623625 (1980)

14. Tomai, C.,Zhang,B., Govindaraju, V.: Transcript mapping for his-toric handwritten document images. In: Proceedings of the 8th In-ternational Workshop on Frontiers in Handwriting Recognition,pp. 413418. Niagara-on-the-Lake, ON (2002)

15. Triebel, R.: Automatische erkennung von handgeschriebenenworten mithilfe des level-building algorithmus. Masters thesis,Institut fur Informatik, alber-Ludwigs-Universtat Freiburg (1999)(in German)

text alignment

Documents