Top Banner
User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad El-Sana Ben-Gurion University of the Negev Beer Sheva,Israel {abedas,rabaev,klara,el-sana}@cs.bgu.ac.il ABSTRACT This work aims to simplify the tiresome manual compari- son of two similar Arabic historical manuscripts. We devel- oped a system that determines the difference between two manuscripts by comparing their components, while ignor- ing page breaks and different warping among consecutive rows; i.e., we treat each manuscript as one long row of com- ponents. We compare two components (blocks of pixels) by extracting features from the columns of their bounding rectangles. We adopted the edit distance, which is com- puted using dynamic time warping (DTW) on the feature domain, to measure similarity between components. The user selects the region to align in two manuscripts and the system return its alignment with visual clues that indicate the distance between the aligned components. In our cur- rent implementation, our system provides good results and requires less interaction for manuscripts at good quality that do not include touching components. We tested our system on different Arabic manuscripts of various qualities and re- ceived encouraging results. Keywords Historical manuscript; Handwritten manuscript alignment, Keyword spotting; Keyword searching; 1. INTRODUCTION Millions of documents were written in Arabic script be- tween the seventh and fourteenth centuries. It has been estimated that 7 - 10 million documents, in various sub- jects, have survived the years and are stored in libraries, museums, and private collections. Before publishing such a historical manuscript it should be revised, approved original copy, and edited. This process is incredibly time-consuming and requires highly educated professionals mainly because of the existence of multiple copies of the same handwrit- ten manuscript. Some of these manuscripts were copied by professional writers, but others were simply copied by schol- ars/students who sought a copy for themselves. When revis- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HIP ’11, September 16 - September 17 2011, Beijing, China Copyright 2011 ACM 978-1-4503-0916-5/11/09...$10.00. ه ا و ا ح ا ل و و, اا , , , , , , و , , درس وا و ا ةء ا و اح واء وا ا ا اب اا ا وه ا وف و ه ا " ل ا ا ي و" و و وي , , , , ذ , , ذ, , د ا ه ل وا ا وا ا ل ا ا ا وا ا ر ب ا ا وي اا وم ر ا اه ا ن ا اب ااب و ا ز ف رح ا م ا لء و آ ةد و و و ء و آ ا آ د ا ا وا ا او ا آت و ا نا آ أوة وا او وا ع ا ا ة ا دا ووض ا ا وا اررات وا اي آ و م ولب ان ا و ه ا" ي ر اذ دار ا ا) او ا ( " و و, دار , , اذ , , ر , , ي , , اا, , ا , , تت ا وا) رات ا( وااي ا ة ا وا ا ا ا از ا هFigure 1: Two pages of a historical manuscript and a transcription of the same title. ing a manuscript it is essential to locate all the copies, com- pare them, and determine the original version. The rate of revising manuscripts explains the complexity of the process – over the last century, less than 250 thousand manuscripts were revised and edited [1]. The advances in digital scanning and electronic storage have driven the digitization of historical documents for preserva- tion and analysis of cultural heritage. This development simplifies accessing historical manuscripts and accelerates the search for the various copies of a manuscript. Never- theless, comparing these copies word-by-word, determining and analyzing the difference between them consume expen- sive scholar’s time. These manuscripts are textually iden- tical in large fractions and differ in small portions. The differences appear in several patterns: the copier altered in- dividual words by synonyms common in his region, inserted, or deleted (did not copy) complete sentences. While delet- ing sentences is rare, it is common to add sentences; e.g., to explain ideas. Several approaches have been developed to align handwrit- ten manuscripts to their transcription [2, 3, 4, 5, 6, 7] (see Section 2). However, the transcription as textual format is not always available and there is a need to compare the available handwritten manuscripts which are represented as a set of images. These manuscripts are very similar – large fractions are textually identical and differ in small regions.
7

User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

Aug 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

User-Assisted Alignment of Arabic Historical Manuscripts

Abedelkadir Asi Irina Rabaev Klara Kedem Jihad El-Sana

Ben-Gurion University of the NegevBeer Sheva,Israel

{abedas,rabaev,klara,el-sana}@cs.bgu.ac.il

ABSTRACTThis work aims to simplify the tiresome manual compari-son of two similar Arabic historical manuscripts. We devel-oped a system that determines the difference between twomanuscripts by comparing their components, while ignor-ing page breaks and different warping among consecutiverows; i.e., we treat each manuscript as one long row of com-ponents. We compare two components (blocks of pixels)by extracting features from the columns of their boundingrectangles. We adopted the edit distance, which is com-puted using dynamic time warping (DTW) on the featuredomain, to measure similarity between components. Theuser selects the region to align in two manuscripts and thesystem return its alignment with visual clues that indicatethe distance between the aligned components. In our cur-rent implementation, our system provides good results andrequires less interaction for manuscripts at good quality thatdo not include touching components. We tested our systemon different Arabic manuscripts of various qualities and re-ceived encouraging results.

KeywordsHistorical manuscript; Handwritten manuscript alignment,Keyword spotting; Keyword searching;

1. INTRODUCTIONMillions of documents were written in Arabic script be-tween the seventh and fourteenth centuries. It has beenestimated that 7 − 10 million documents, in various sub-jects, have survived the years and are stored in libraries,museums, and private collections. Before publishing such ahistorical manuscript it should be revised, approved originalcopy, and edited. This process is incredibly time-consumingand requires highly educated professionals mainly becauseof the existence of multiple copies of the same handwrit-ten manuscript. Some of these manuscripts were copied byprofessional writers, but others were simply copied by schol-ars/students who sought a copy for themselves. When revis-

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee.HIP ’11, September 16 - September 17 2011, Beijing, ChinaCopyright 2011 ACM 978-1-4503-0916-5/11/09...$10.00.

!�ل ���� ا���� �����ح ����ا����� و���� ا���ه �� ��� ��

�� ,������� ,��� �� اا ,و#%$�"� و#�"��������, �������,

و��� �"-�, درس وا���� , �������, )����!و ,������� ,���'��

���ا�����3 وا���! ���4�ء وا���حا�����3 و���2�ء ا��0)ة ���. ا�

و���� وف;���>� �4اوه;, ا:�� �0ا89 ا�7!ب ا��"5����3. ا����

و#%$�"� "و�� �D!ي �2��B ا�A�7 ���@�! ا�;�0ل"< �;�ا���ه ��

,�2@@�!ذ ,������� ,�2���, �77�B����� �D,!ي و�� و#�"��3

وا�A�7 ا�@�A وا�;�0ل ����"5�3 ه0 ا��-%�د , �"��0 ,ذ���0 ,�������

ا�"!ب #���رIJا '9$!ا�� وا����8 ا�!H IJ�ل ا�F��G :�� ا�42!

H�-ا��� ��وي :�ا و�� ا�4IJم ر�ا�%� -� �اه;G� L"#!# 3 ا���

);ف )!��� 05Oز ��� �>�!IJ ا'9$!اب و��� ا9$!اب 'ن ا�

!�P� �� ��: 3�JرIGح ا�!Q �� ا':�م R�Q ل�Hء وIJ Fة�� آ

! ����!I5# آ���!#" �-ا�"�F آ%$8 وIJء و>�< وQ$ و0Jد

آW�H !"Q F اIJاو, U%��� �<0V! ا���#� وا��"!ا O$�% ا>��د,

�B!, وا�%�Uة أو��0#� :0ا آ�ن �� �4! ا�!IJ وUH!ت

�� واIJاوP� W: #00ع ا��5�ا� ��"��� , �� داO!ة ا�����2 �

�!ات وا��ر�7O ار�8 وا9!�� \��3 ا�"!وض ا'و�� #��3

ا���ه �� و��� 9!��ن ا�7!ب ا'ول #�م ����� و���� آ���3اي

")�Iاو ا�(اO��#� ��F ا��I! دار ����� اذ :���� J�ر#� �H! #!ي"

���"�� ,�� اذ :�� ,����"�� ,>�_ دار ,و#�"��� و#%$�"�� ,

���"�� ,�H!>�!ي ,����"�� ,J�ر#� ���, ��#�Oاا, ��"����, F��

�! ���Iاي ا��"5�3 وا�I )ا'Q�رات( وا'O�ت ا�"���ت, ����"�� ,�I!ا�

ه� ا� �� 8$%��0J 8از ��. ا�2� ا7Oا��7�����0 وا�2� ا��0)ة

ا�4O J�8 ز�!, ا�7!ب

Figure 1: Two pages of a historical manuscript anda transcription of the same title.

ing a manuscript it is essential to locate all the copies, com-pare them, and determine the original version. The rate ofrevising manuscripts explains the complexity of the process– over the last century, less than 250 thousand manuscriptswere revised and edited [1].

The advances in digital scanning and electronic storage havedriven the digitization of historical documents for preserva-tion and analysis of cultural heritage. This developmentsimplifies accessing historical manuscripts and acceleratesthe search for the various copies of a manuscript. Never-theless, comparing these copies word-by-word, determiningand analyzing the difference between them consume expen-sive scholar’s time. These manuscripts are textually iden-tical in large fractions and differ in small portions. Thedifferences appear in several patterns: the copier altered in-dividual words by synonyms common in his region, inserted,or deleted (did not copy) complete sentences. While delet-ing sentences is rare, it is common to add sentences; e.g., toexplain ideas.

Several approaches have been developed to align handwrit-ten manuscripts to their transcription [2, 3, 4, 5, 6, 7] (seeSection 2). However, the transcription as textual formatis not always available and there is a need to compare theavailable handwritten manuscripts which are represented asa set of images. These manuscripts are very similar – largefractions are textually identical and differ in small regions.

Page 2: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

In this work we aim to simplify the procedure of compar-ing two similar manuscripts. We are interested in deter-mining the regions that include the same words in the twomanuscripts and those that include different words. Thenaive solution would be to convert the document images intotext and perform the comparison on the text level. However,this approach cannot produce acceptable results, as the suc-cess in off-line handwriting recognition has been limited todomains with small vocabulary, such as mail sorting andcheque processing [8]. In addition, the degraded quality ofhistorical documents dramatically reduces the recognitionrates of traditional Optical Character Recognition (OCR)algorithms to practically unacceptable levels.

We developed a semi automatic user-assisted system thatcompares the images of two manuscripts and determines thedifference between them. Our system compares the con-nected components of the two manuscripts, while ignoringpage breaks among consecutive pages and different warpamong consecutive rows; i.e., we consider each manuscriptas one long row of components. We perceive a sequence ofcomponents as a string, where each letter represents a com-ponent, and apply DTW to determine the longest commonsubstring. We then analysis the matching matrix to deter-mine the inserted, deleted, or substituted components. Thisframework heavily depends on the accuracy of the functionthat computes the distance between images of two compo-nents.

The image quality of historical manuscripts plays a majorrole in the ability to extract their constituting rows andwords correctly. Our system provides good results and re-quires less interaction for manuscripts at good quality thatdo not include touching components. The comparison ofimages of two components is performed by extracting andcomparing features from the two images, similar to the ap-proach proposed by Kornfield et al. [9].

In the rest of the paper, we briefly overview closely relatedwork, present our approach in detail, report experimentalresults, and finally we conclude and suggest directions forfuture work.

2. RELATED WORKPattern alignment has been studied in various fields, suchas speech and handwriting recognition and sequences align-ment in Bioinformatics [10, 3, 11]. Its complexity dependson the structure and the length of the observed sequences.In handwriting recognition it is still an open issue [5]. Align-ment of handwritten documents is is closely related to word-spotting and keyword searching, and they often share theunderlying component matching procedure [9]. Next webriefly overview related work in Keyword searching and wordspotting.

Keyword-searching algorithms search through a collectionof document images for a pictorial representation of a key-word without considering their textual representation. Wordspotting clusters similar words into groups, for which tex-tual representations are assigned and used to index the doc-ument. The unavailability of reliable OCR algorithms forhandwritten historical documents makes word spotting ap-proach a practical alternative [12].

Dynamic Time Warping (DTW) has become a prevalenttechnique in word-spotting algorithms, as it appears to pro-vide better results [13]. Manmatha and Rath [14, 8] gen-erated and analyzed sets of feature vectors for word im-ages, which are compared using DTW. Rothefeder et al. [15]adopted an algorithm that matches words according to thecorrespondences of points-of-interest on their representativeimages. Shrihari et al. [16, 17] utilized global word shape fea-tures, such as stroke width, slant, and gaps between words,to measure similarity. Ntzios et al. [18] classified charac-ters based on the protrusile segments and the topology ofthe character skeletons. Konidaris et al. [19] showed thata combination of synthetic data and user feedback leadsto an improved performance for keyword-guided word spot-ting. Gatos and Pratikakis [20] used a template matchingprocess to compare image blocks that consider various step-wise transforms. Saabni and El-Sana [21] extract featuresfrom the boundary contours of continuous words and applyDTW to measure the distance between them. Word spottingwas also addressed in several works as a learning problem.Rath et al. [22, 23] proposed a probabilistic classifier thatwas trained using discrete feature vectors extracted fromdifferent word images. Lavrenko et al. [24] classified wordimages, in a holistic manner, using representative featuresand Hidden Markov Model. Gatos et al. [25] presented asegmentation-free algorithm for typewritten keyword searchin Greek historical documents.

Several approaches have been developed to align handwrit-ten manuscripts to their transcription. Tomai et al. [2] seg-mented handwritten documents into text lines, generateddifferent word segmentation for each line, and selected thebest alignment between the word of the transcription andthe result of a word recognizer. Kornfield et al. [3] segmentedhandwritten documents into lines and words and extractedfeatures from the segmented words and the transcriptionwords. They then applied DTW to compute the alignment.Rothfeder et al. [4] aligned ASCII text to words segmentedfrom handwritten documents. To overcome the irreliabilityof word segmentation procedure, they use an HMM basedalgorithm for the alignment. Indermuhle et al. [5] used anHMM based handwriting recognizer that accepts completetext lines, which are mapped to their counterparts in theprinted version. Zinger et al. [26] automatically segmentedand manually transcribed text lines from the historical doc-uments to generate training data for their word recognizer.They aligned segmented words to the transcription based onthe longest spaces between portions of handwriting and therelative word length. Huang and Srihari [6] applied wordrecognition using a small size lexicon, which is reduced byutilizing the information provided by the transcript. Therecognition results are aligned using a dynamic program-ming algorithm. Lorigo and Govindaraju [7] extended DTWto true distances when mapping multiple entries from twodifferent sequences and concurrently map elements of a par-tially aligned third series within the main alignment.

These approaches map words or lines segmented from ahandwritten document to their transcription. However, thetranscription is not always available and there is a need tocompare the available manuscripts which are mostly verysimilar. In this paper we aim to simplify the alignment oftwo historical documents.

Page 3: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

Figure 2: The performance of individual features(top) and combination of features (bottom)

3. OUR APPROACHWe developed a semi-automatic approach to align the imagesof two manuscripts and determine the difference betweenthem by comparing their pages one after the other, whileignoring page breaks. We measure the difference in tex-tual content between two pages (images) by extracting therows in each page image and comparing them component-by-component, while ignoring the different text line warps.

Historical documents appear in various qualities and usu-ally suffer from range of artifacts, such as faded ink, stainedpaper, dirt, holes, and broken or smeared characters. Theimage quality of these manuscripts directly effects the ac-curacy of word extraction. In this work we address Ara-bic manuscripts that do not include touching componentsamong their constituting words. The core procedure of ouralgorithm is to compare images of two text components,which is performed by extracting features from the columnsof the two images and comparing them.

3.1 The Matching AlgorithmNumerous research efforts have been devoted for the devel-opment of various string matching algorithms on the textdomain [27]. The comparison of two text rows in the im-age domain adds another dimension of complexity becauseof the lower level of confidence in measuring the distance be-tween two image blocks. In addition, it is not always possibleto extract individual letters, especially in inherently cursivescripts such as the Arabic script, which makes sub-word,word, or even the entire row the lowest possible granular-ity. To measure the distance between two components inthe image domain we extract features from each pixel col-umn - feature vector - and compare the two arrays of featurevectors.

We have studied several features which appear in the litera-ture [8, 14, 21, 28], and found that Vertical Profile (VP),Lower Gradient (LG), and Difference between Upper andLower Profile (DULP) features provide the best performanceon handwritten Arabic historical manuscripts in our datasets.Figure 2 shows the performance of each individual featureand the combined features. A feature vector defines a sym-bol, ω, and the feature vectors define a set of symbols, Σ. Asequence of feature vectors represents a string s in Σ∗. Thedistance between two strings – edit distance – is computedusing dynamic time warping [29] and is based on Equa-tion 1, where insertion, deletion, and substitution have thesame cost.

ed(i, j) = min{ed(i, j − 1), ed(i− 1, j), ed(i− 1, j − 1)}+ costi,j (1)

We compute the distance matrix, which is used by the dy-namic time warping, by taking the distance, dij , betweentwo feature vectors vi and vj as the norm of their difference;i.e., dij = ‖vi − vj‖. This is similar to the approach pro-posed by Rath and Manmatha [14], but taking the norm ofthe difference instead of the squared norm, as it appears toprovide better results for our datasets.

...a0 a1 a2 ai ai+kai+1 ai+2... ... an

...b0 b1 b2 bj bj+kbj+1 bj+2... ... bn

k symbolsProcessed symbols

Sorg

Sins

lporg

lpins

Figure 3: Processing word-by-word, Sorg is the orig-inal text and Sins is the inspected text.

It is often the case that one of the manuscripts is origi-nal or was approved against an original copy and the othermanuscript is to be inspected with respect to it. Let usutilize this terminology to simplify the discussion, we referto one of the manuscripts as the original and to the otheras the inspected, denoted by subscripts org and ins, respec-tively. We seek to determine the components in the originalmanuscript that were removed or substituted and those thatwere added (inserted) to the original manuscript to reach theinspected one. Let us also assume that the number of con-secutive insertions, deletions or substitutions is bounded bya given k.

3.2 Component-by-Component MatchingSince we compare the manuscripts component-by-component,we could perceive each manuscript as a sequence of compo-nents. Let these components be letters in Σ and a sequenceof components is a string s, where s ∈ Σ∗.

Our matching algorithm aims to find the best match betweenthe letters of the two strings, and mark the letters in theoriginal string that do not have a match in the inspectedstring as removed letters and those in the inspected stringthat do not have a match in the original string as the inserted

Page 4: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

1 2 3 4 5 760

0

Original words

1

2

3

4

5

6

7In

spec

ted

wor

ds

1 2 3 4 5 760

0

1

Original words

2

3

4

5

6

7

Insp

ecte

d w

ords

(a) (b)

1 2 3 4 5 760

0

1

2

3

4

5

6

7

Original words

Insp

ecte

d w

ords

(c)

Figure 4: The minimal path spanned by(a)substitution, (b) one insertion, and (c) one dele-tion operations, respectively.

letters. In addition, we need to indicate the letters in theoriginal manuscript that were replaced by different letters inthe inspected manuscript.

The processed components in the inspected manuscripts mayappear at most at place k in the original string, as weassumed insertion, deletion, or substitution of at most kconsecutive letters. For that purpose we consider k let-ters from the two strings Sorg = a0, a1, .., an−1 and Sins =b0, b1, .., bm−1, which represent the original and inspectedstrings, respectively. Let us consider a general step in ouralgorithm where we have processed i − 1 letters from thestring sorg and j − 1 letters from string sins, as shown inFigure 3. We first compute the distance matrix, D, betweenthe following k letters in the two strings, which is also thematrix we use to compute the edit distance between the twosub-strings of size k.

We use DTW to compute the edit distance between the twosub-strings and analyze the spanned path. The structureof this path is used to infer the applied operation (delete,insert, or substitute) and determine the progress of the twopointers lporg and lpins that mark the start of the processedcomponent sequences.

To analyze the structure of the path, let us consider the be-havior of the minimal path. Let us first consider the trivialcase, where the two strings are identical and the letters ineach string are different from each other. Since the edit dis-tance between two identical letters is zero, this configurationgenerates a perfect diagonal line, as shown in Figure 4(a).Next we analyze the structure of the path at each one of theoperations – deletion, insertion, and substitute – at one po-sition, while keeping the rest of the sequence intact. For an

operation that takes place in the i-th letter, DTW matchesthe identical prefixes of the strings; i.e., it matches orgj toinsj , where j = 0, .., i− 1.

Substitution: The ith letter srci in the original manuscriptwas replaced by the ω letter insi in the inspected manuscript.There are three possibilities for the minimal path to mani-fest this substitution:(a).., di−1,i−1, di,i, di+1,i+1, ..(b).., di−1,i−1, di−1,i, di,i+1, di+1,i+1, ..(c).., di−1,i−1, di,i−1, di+1,i, di+1,i+1, ..where dr,c are the entries of the distance matrix, D. Thepath taken is determined by the minimal edit distance be-tween the corresponding letters; i.e., ed(orgi−1, ω), ed(orgi, ω)and ed(orgi+1, ω) (recall that these letters represent con-nected components).

Insertion: The insertion of the letter ω at the ith posi-tion in the original manuscript. Let us assume without lossof generality that ω is not similar to any letter in the in-spected manuscript. DTW matches the identical suffixes ofthe two strings (i.e., it matches orgj+x to insj+1+x wherex = 0, .., k − j) and the generated minimal path, which isshown in Figure 4(b), is: .., di−1,i−1, di−1,i, di,i+1, ...

Deletion: The deletion of the letter insj from the jth po-sition guides the DTW to match the identical suffixes of thetwo strings and the generated minimal path, which is shownin Figure 4(c), is: .., di−1,i−1, di,i−1, di+1,i, ...

The lp pointers, which point to the head of the comparedsubstring, in each of the strings is incremented accordingto the structure of the minimal path; we increment by onethe two pointers for equal or substituted letters, incrementthe lporg for deleted letters, and increment lpins for insertedletters. The high confidence matches are used as anchorsto resolve the medium confidence matching; e.g., a sequenceof high confidence values with one weak match between twoletters (components) in between can rule out insertion anddeletion, as shown in Figure 7.

4. EXPERIMENTAL RESULTSWe implemented our approach and performed various testson different datasets. We have adopted the method recentlydeveloped by Saabni and El-Sana et al. [30] to extract textlines from the compared documents and generate a set ofimages that represent the rows of the documents. This lineextraction method, which is based on Seam Carving frame-work [31], computes an energy map of the input text blockimage and determines the seams that pass across text lines.

Measuring distance ed(wi, wj) between the images wi andwj is performed on the feature domain and computes thesimilarity of the two images. To determine whether twoimages are “equal” or not we normalize the edit distanceand use two thresholds – upper and lower. Distance valuesbelow the lower and above upper thresholds indicate highconfidence and no-confidence, respectively, and in betweenspecifies medium confidence. The user can either indicatethe furthest possible match k or select anchor word-parts.In the second scheme the user selects the start and endword-parts on the source manuscript and their counterpartsfrom the inspected manuscript and the system search for the

Page 5: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

Figure 5: The results with visual clues of comparinghandwritten-to-handwritten manuscripts.

Figure 6: The results with visual clues of comparinghandwritten-to-printed manuscripts.

Frequency Precision RecallModification (%) (%) (%)Similar 92-97 80-95 75-90Substitution 2-5 60-85 65-85Insertion 1-2 80-98 70-85Deletion 0-1 80-98 60-80

Table 1: The performance of our algorithm for com-paring handwritten and printed transcriptions foreach modification category.

best alignment of the two sub-manuscripts.

We have also added a visualization tool that simplifies thelocation of the difference between the two sets. We use colorcoding, which is based on the values of the edit distance, tomark the regions on the inspected manuscript. Low, high,and medium confidence are marked with green, blue, andyellow color levels, respectively and insertion componentsare marked with red color.

Many original manuscripts were transcripted and it is possi-ble to obtain their printed (usually hard-copy) copies, whichare easy to binarize and extract their components. For thesemanuscripts, we align a printed manuscript with a hand-written one. We experimented with several manuscriptswith an available transcription and received encouraging re-sults. Table 1 summarizes the results of aligning a hand-written manuscript with a printed transcription, which maynot be identical. The Frequency column shows the differ-ence between the historical manuscript and the available

Figure 7: The manifestation of the substitution(top) and insertion (bottom) modifications on theminimal path.

Frequency Precision RecallModification (%) (%) (%)Similar 92-97 80-98 80-90Substitution 2-5 50-75 50-70Insertion 1-2 80-95 60-80Deletion 0-1 80-95 60-80

Table 2: The resulting alignment of manuscript pairs(each pair include different versions of the same ti-tle) for each modification category.

printed manuscript in each category. The Precision and Re-call columns report the performance of our algorithm in eachcategory. These results were received for lower and upperthresholds in the range [0.4, 0.6] and [0.65, 0.8], respectively.

For most original manuscript it is not possible to obtain atranscripted copy, which dictates handwritten to handwrit-ten alignment. Figure 5 and 6 presents the performanceof our system on comparing handwritten to handwrittenmanuscripts and handwritten to printed manuscripts, re-spectively. These documents do not include touching com-ponents, thus we were able to extract their constituting com-ponents. Table 2 summarizes the results of comparing twosimilar handwritten manuscripts. The Precision and Recallcolumns report the performance of our algorithm in eachmodification category. These results were obtained for lowerand upper thresholds in the range [0.2, 0.4] and [0.7, 0.9], re-spectively.

Our experiments show that classifying similar components(word-parts) as substitution and substituted components assimilar are the most common errors. Aligning similar com-ponents and substituted components often results with samestructure of the minimal path, but they have different values

Page 6: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

on the distance matrix. Determining the lower and upperthresholds is challenging, as it depends on the difference be-tween the scripts (both handwritten and printed). However,using color coding to visualize the similarity and differencebetween the compared manuscripts simplify the adjustmentof thresholds and the location of modifications (changes).

The values of k which determine the size of the aligned se-quences play a major role in obtaining good results. Toosmall values seem to corrupt the progress of the pointersthat indicate beginning of the compared sequences and toolarge values spreads the difference over large regions andcomplicates the analysis of the different regions.

5. CONCLUSIONSWe presented a novel approach to align two similar historicalmanuscripts and reported its performance on Arabic Histor-ical Documents. Our current implementation provides goodresults and require less interaction for manuscripts at goodquality that does not include touching components; i.e., itis possible to correctly extract the lines and continuous sub-words (connected components) of the manuscripts. To com-pute the distance between two components in the image do-main we extract features from each pixel column - featurevector - and compare the two arrays of feature vectors thatrepresent the two components using DTW. To determine thedifference between two sequences of components, we applyDTW and analyze the generated minimal path to determinethe type of difference – insertion, deletion, or substitution.To simplify the location of the difference between the twomanuscripts we incorporate a visualization tool within thealignment system. The visualization tool superimpose thevalues of the edit distance on the compared manuscripts ascolor codes. We experimented with different manuscripts atvarious image qualities and received encouraging results.

The scope of future work includes applying machine learn-ing techniques that utilize the user feedback to refine thematching procedure while processing the two manuscripts.

6. ACKNOWLEDGMENTThis research was supported in part by the Israel ScienceFoundation grant no. 1266/09, DFG-Trilateral Grant no.8716, the Lynn and William Frankel Center for ComputerSciences at Ben-Gurion University, Israel. We would like tothank the reviewers for their insightful comments which ledto several improvements in the presentation of this paper.

7. REFERENCES[1] “Personal conversations with historians that include

dr. muamda yeha and dr. husam afan.”

[2] C. Tomai, B. Zhang, and V. Govindaraju, “Transcriptmapping for historic handwritten document images,”in Frontiers in Handwriting Recognition, 2002.Proceedings. Eighth International Workshop on, 2002,pp. 413 – 418.

[3] E. M. Kornfield, R. Manmatha, and J. Allan, “Textalignment with handwritten documents,” inProceedings of the First International Workshop onDocument Image Analysis for Libraries (DIAL’04),ser. DIAL ’04. Washington, DC, USA: IEEEComputer Society, 2004, pp. 195–.

[4] J. Rothfeder, R. Manmatha, and T. M. Rath,“Aligning transcripts to automatically segmentedhandwritten manuscripts,” in Proceedings of the 7thIAPR Workshop on Document Analysis Systems,2006, pp. 84–95.

[5] E. Indermuhle, M. Liwicki, and H. Bunke, “Combiningalignment results for historical handwritten documentanalysis,” in Document Analysis and Recognition,2009. ICDAR ’09. 10th International Conference on,july 2009, pp. 1186 –1190.

[6] C. Huang and S. N. Srihari, “Mapping transcripts tohandwritten text,” in in Proceedings of the 10thInternational Workshop on Frontiers in HandwritingRecognition. IEEE Computer Society, 2006, pp.15–20.

[7] L. M. Lorigo and V. Govindaraju, “Transcriptmapping for handwritten arabic documents,” inProceedings Document Recognition and Retrieval,X. Lin and B. A. Yanikoglu, Eds., vol. 6500, no. 1.SPIE, 2007, p. 65000W.

[8] T. Rath and R. Manmatha, “Features for wordspotting in historical manuscripts,” in Proceedings ofthe 7th International Conference on DocumentAnalysis and Recognition,, 3-6 Aug. 2003, pp. 218–222.

[9] E. M. Kornfield, R. Manmatha, and J. Allan, “Furtherexplorations in text alignment with handwrittendocuments,” Int. J. Doc. Anal. Recognit., vol. 10, pp.39–52, May 2007.

[10] A. Haubold and J. Kender, “Alignment of speech tohighly imperfect text transcriptions,” in Multimediaand Expo, 2007 IEEE International Conference on,july 2007, pp. 224 –227.

[11] H. Li and N. Homer, “A survey of sequence alignmentalgorithms for next-generation sequencing,” Briefingsin Bioinformatics, vol. 11, no. 5, pp. 473–483, 2010.

[12] R. Manmatha, C. Han, and E. M. Riseman, “Wordspotting: New approach to indexing handwriting,” inProceeding of Computer Vision and PatternRecognition, 1996, pp. 631–637.

[13] R. Manmatha and T. Rath, “Indexing of handwrittenhistorical documents - recent progress,” in Symposiumon Document Image Understanding Technology, 2003,pp. 77–85.

[14] T. Rath and R. Manmatha, “Word image matchingusing dynamic time warping,” in Proceedings of theConference on Computer Vision and PatternRecognition, Madison, vol. 2, June 2003, p. 521U527.

[15] J. Rothfeder, S. Feng, and T. Rath, “Using cornerfeature correspondences to rank word images bysimilarity,” in Computer Vision and PatternRecognition Workshop, 2003, pp. 30–36.

[16] C. H. S. N. Srihari, H. Srinivasan and S. Shetty,“Spotting words in latin, devanagari and Arabicscripts,” Vivek: Indian Journal of ArtificialIntelligence,, vol. 16, no. 3, pp. 2–9, 2003.

[17] P. B. S. N. Srihari, H. Srinivasan, and C. Bhole,“Handwritten arabic word spotting using thecedarabic document analysis system,” in Proc.Symposium on Document Image Understanding,College Park, MD, November 2005.

[18] K. Ntzios, B. Gatos, I. Pratikakis, T. Konidaris, andS. J. Perantonis, “An old Greek handwritten OCR

Page 7: User-Assisted Alignment of Arabic Historical Manuscripts · 2011-09-04 · User-Assisted Alignment of Arabic Historical Manuscripts Abedelkadir Asi Irina Rabaev Klara Kedem Jihad

system based on an efficient segmentation-freeapproach,” International Journal on DocumentAnalysis and Recognition, vol. 9, no. 2-4, pp. 179–192,2007.

[19] T. Konidaris, B. Gatos, K. Ntzios, I. E. Pratikakis,S. Theodoridis, and S. J. Perantonis, “Keyword-guidedword spotting in historical printed documents usingsynthetic data and user feedback,” in IJDAR, vol.9(2-4), 2007, p. 167U177.

[20] B. Gatos and I. Pratikakis, “Segmentation-free wordspotting in historical printed documents documentanalysis and recognition,” in ICDAR ’09, 2009, pp.271–275.

[21] R. Saabni and J. El-Sana, “Keyword searching forArabic handwritten documents,” in The 11thInternational Conference on Frontiers in HandwritingRecognition (ICFHR2008), Montreal, Canada, 2008,pp. 716–722.

[22] T. Rath, V. Lavrenko, and R. Manmatha, “Retrievinghistorical manuscripts using shape,” University ofMassachusetts, Tech. Rep. 328, 2003.

[23] V. L. T. Rath and R. Manmatha, “A statisticalapproach to retrieving historical manuscript images,”Center for Intelligent Information Retrieval technical,MM, 2003.

[24] V. Lavrenko, T. Rath, and R. Manmatha, “Holisticword recognition for handwritten historicaldocuments,” in Document Image Analysis forLibraries, 2004, pp. 278–287.

[25] B. Gatos, T. Konidaris, K. Ntzios, I. Pratikakis, andS. Perantonis, “A segmentation-free approach forkeyword search in historical typewritten documents,”in Document Analysis and Recognition, 2005, pp.54–58.

[26] S. Zinger, J. Nerbonne, and L. Schomaker,“Text-image alignment for historical handwrittendocuments,” in DRR, ser. SPIE Proceedings,K. Berkner and L. Likforman-Sulem, Eds., vol. 7247.SPIE, 2009, pp. 1–10.

[27] G. Navarro, “A guided tour to approximate stringmatching,” ACM Computing Surveys, vol. 33, pp.31–88, March 2001.

[28] M. A. Aleksander Kolcz, Joshua Alspector, “Aline-oriented approach to word spotting inhandwritten documents,” Pattern Analysis andApplications, vol. 3, no. 2, pp. 153–168, 2000.

[29] H. Sakoe and S. Chiba, “Dynamic programmingalgorithm optimization for spoken word recognition,”IEEE Transactions on Acoustics, Speech and SignalProcessing, vol. 26, no. 1, pp. 43– 49, 1978.

[30] R. Saabni and J. El-Sana, “Language-independent textlines extraction using seam carving,” in InternationalConference on Document Analysis and Recognition,2011.

[31] S. Avidan and A. Shamir, “Seam carving forcontent-aware image resizing,” ACM Trans. Graph.,vol. 26, no. 3, p. 10, 2007.