Top Banner
Digital Access of Handwritten Documents Venu Govindaraju Anurag Bhardwaj Huaigu Cao [email protected]
52

Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Mar 13, 2018

Download

Documents

hadang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Digital Access of Handwritten Documents

Venu GovindarajuAnurag Bhardwaj

Huaigu Cao

[email protected]

Page 2: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Outline

RecognitionPostal ApplicationParadigms

SearchOCR accuracy

FusionLexicon ReductionStatistical Topic Models

Document SearchWord Spotting

Page 3: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Challenge of Handwriting

Page 4: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Motivation

Vast, irreplaceable, culturally vital legacy collections of historical documents are competing ineffectively for attention with billions of digital documents

Thus historical archives are threatened with

neglect, perceived irrelevance, …. & eventually,

oblivion?

Threat: ‘If it’s not in Google, it doesn’t exist!’

[Baird 2003]

Page 5: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Postal Context (138 mil records) ZIP Code30% of ZIP Codes contain a single street name5% of ZIP Codes contain a single primary number2% of ZIP Codes contain a single add-on

<ZIP Code, primary number>Maximum number of records returned is 3,071

<ZIP Code, add-on>Maximum number of records returned is 3,070

Lex Top 1 Top 2

10 96.5 98.7

100 89.2 94.1

1000 75.3 86.3

LDR

Page 6: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Paradigms

Context Ranked Lexicon

Lexicon Driven OCR

LDR

Lexicon Free OCR

LFR

Segmentation Recognition Post-processing

Page 7: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Lexicon Free (LFR)

i[.8], l[.8] u[.5], v[.2]

w[.6], m[.3]

w[.7]

i[.7]u[.3]

m[.2]m[.1]

r[.4]

d[.8]o[.5]

-Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a ‘w’ with 0.7 confidence-Image from segment 1 to 5 is a ‘w’ with 0.6 confidence and an ‘m’ with 0.3 confidence

Find the best path in graph from segment 1 to 8

Page 8: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Lexicon Driven (LDR)

w[7.6]

w[7.2]r[3.8]

w[5.0]

w[8.6]

o[7.6]r[6.3]

d[4.9]

w[5.0]

o[6.6]

o[6.0]

o[7.2]o[10.6] d[6.5]

d[4.4]

r[7.5]r[6.4]

o[7.8]r[8.6]

r[7.6]

o[8.3]

o[7.7]r[5.8]

1 2 3 4 5 6 7 8 9

o[6.1]

Find the best way of accounting for characters‘w’, ‘o’, ‘r’, ‘d’ buy consuming all segments 1 to 8

Distance between lexicon entry ‘word’ first character ‘w’ and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6

[Kim & Govindaraju, TPAMI 1997]

Page 9: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

a) Amherst b) Buffalo c) Boston

Interactive Models (LDR)2-way interaction

a) San Jose b) Buffalo c) Washingtond) None of the above

Page 10: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Search for Handwritten Documents

LexiconGood Quality10K 1K

Historical10K 1K

Medical4K

Top 1 (%) 57 67 12 28 20

Top 3 (%) 69 72 22 44 27

Top 10 (%) 74 75 32 72 42

• Lexicons are typically large: >5K• Need around 70% accuracy

Strategy• Reduce lexicon size using topic categorization (DAS 06;08)• Use Top-N choices returned by OCR (ICDAR 07)

[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007]

Page 11: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Outline

RecognitionPostal ApplicationParadigms

SearchOCR accuracy

FusionLexicon ReductionStatistical Topic Models

Document SearchWord Spotting

Page 12: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

?1ffN ≡

Fusion of RecognizersType III

),( 21

11 ssfN

LDR

5.6

7.4

LFR

.52

.81

Identification task:

Amherst

Buffalo

Verification task:

5.6 .52Amherst

),( 22

12 ssfN

),( 211 ssf

1S

2SNi ,...,1

maxarg=

θ>SAccept

Reject

Page 13: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

• Sum rule

• Weighted sum rule

• Product rule

• Max rule

• Rank-based methods

Traditional Fusion Rules2121

1 ),( ssssf +=2

21

121

1 ),( swswssf +=

21211 ),( ssssf ×=

),max(),( 21211 ssssf =

}),,{,( 111

111Niii sssrankrs K=→

21211 ),( iiii rrssf +=

)|,(),( 21211 genrrPssf iiii =

Page 14: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Likelihood RatioVerification Tasks

Impostor

Genuine

Rec

ogni

zer s

core

2

Recognizer score 1

• 2 classes: imposter and genuine• Pattern classification task

),(),(

),( 21

2121

sspssp

ssfimp

genlr =

Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:

Metaclassification with NN, SVM, etc. also possible

lrV ff =

Vf

[Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]

Page 15: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Optimal Combination functions

LFR is correct 54.8%LDR is correct 77.2%Both are correct 48.9%

Either is correct 83.0%

Likelihood Ratio 69.8%Weighted Sum 81.6%

• LR combination is worse than single matcher Vf

LRV ff =

Identification Task Results

Top choice correct rate

Verification Task Results

ROC

[Tulyakov & Govindaraju IJPRAI 2009]

Page 16: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS ≠= KK

Independence of ScoresIn a single trial

),( 21

11 ssf

Amherst

5.6

7.4

Buffalo

.52

.81

LDR

LFR

),( 22

12 ssf

…. ….

[Tulyakov & Govindaraju IJPRAI 2009]

Page 17: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Dependencies

OCR

A B C …

.95 .89 .76 …

A B C …

.80 .54 .43 …

∏=j

kkj

kj

kkj

kj

k CtspCtsp

C)|,()|,(

maxarg

[Tulyakov & Govindaraju IJPRAI 2009]

Page 18: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

• Initialize a combination function

• Get scores from the same identification trial (for all trials)• Update function so Genuine score better than any impostor score

),,,(),,,(

() 21

21

Miiiimp

Miiigen

ssspsssp

fK

K=

),,,( 21 Msssf K

0,1

1())( 1

22

11

≥+

=+++++− jsss M

MMe

f ααααα K

Best Impostor Function

Sum of Logistic Functions

Iterative Methods

Likelihood Ratio

Weighted sum

Best Impostor Likelihood Ratio

Logistic Sum

Neural Network

LFR & LDR 69.84 81.58 80.07 81.43 81.67

[Tulyakov & Govindaraju IJPRAI 2009]

Page 19: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

•Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular

•OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data

6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed

Search EngineHandwritten Forms

Page 20: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Search Engine for Medical Forms

•Find all people who reported asthma problems in NY•How many people with high blood pressure are on medication X?•Is there an epidemic breaking?

Page 21: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Lexicon Reduction

Large Lexicon> 5K

HandwrittenMedical

Documents

Lex Driven

Improve Performance

Lex Free ICR Features Topic Category

Reduce Lexicon~2.5K

[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

Page 22: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

ICR Features Index

[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

Page 23: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

cohesion(wa ,wb ) = z • f (wa ,wb )f (wa )* f (wb ))

DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS

Topic Features

[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

Page 24: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

(Chu-Carroll, et al., 1999)

Bt, c =At, c

At, e2

e=1

n

IDF( t) = log 2n

c( t)

Xt, c = IDF(t) ⋅Bt, c

Topic Categorization

24

Cosine similarity between trained topic vectors and test document

Page 25: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Results

CLT to RLT CL to RL CLT to ALT CLT to SLT

HR ↑7.48% ↑7.42% ↑17.58% ↑7.42%

Error Rate ↓10.78% ↓10.88% ↓24.53% ↓10.21%

C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present

[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

Page 26: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

•Train topic categorization maximum entropy model

•Generate topic distribution of test document

•Use topic distribution to score each topic as new prior

•Compute posterior probability of word recognition

•Improves from 32% to 40% on IAM dataset

Statistical Topic Modeling

Input Word Image

Toggle – 0.92

Google – 0.90

Noodle – 0.70

.

.

Google – 0.96

Toggle – 0.72

Noodle – 0.58

.

.Noisy Output Corrected Output

Correction Model

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Page 27: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Statistical Topic Modeling

Input Word Image

Toggle – 0.92

Google – 0.90

Noodle – 0.70

.

.

Google – 0.96

Toggle – 0.72

Noodle – 0.58

.

.Noisy Output

p(word-image | term)

Corrected Output

P(term | word-image)

Correction Model

P( term | word-image )

= P( word-image | term ) x P ( term )= P( word-image | term ) x ∑ { P ( term | LMi ) x P (LMi) }

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Page 28: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Language Model = P( t | LMi )

Category c1 Document

Category c2 Document

P(eye|c1) – 0.92

P(brain|c1) – 0.90

.

.

P(china|c1) – 0.09

P(trade|c2) – 0.82

P(bank|c2) – 0.78

.

.

P(eye|c2) – 0.1

Category c1 Language Model

LM1

Category c2 Language Model

LM2

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Page 29: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Topic Distribution = P( LMi)

∑∑

∑=

c

cdf

cdf

iii

iii

e

edcP),(

),(

)|(λ

λ• Train the Max-Entropy model - fix λi

fi is feature (e.g., normalized word counts)

I – 0.80

T - 0.65

H – 0.35

.

JULY – 0.90

FULLY - 0.75

DULY – 0.65

.

CAVE – 0.70

HAVE - 0.55

HAS – 0.15

.

DECEIVED – 0.95

RECEIVED - 0.55

PERCEIVED – 0.30

.

FAVOR – 0.70

YOUR - 0.55

COLOR – 0.15

.

YOUR – 0.95

HAVE - 0.15

HAS – 0.10

.

count(YOUR) = 0.95 + 0.55 = 1.50

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Page 30: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Experiments

Corpus : IAM DatabaseWord Recognizer : WMRTopic Categorization : MalletLM : CMU-Cambridge LM toolkit# of training docs : 380# of train categories : 13# of test docs : 70# of test word images : 4033

Method Word Recognition

Raw 32.33 %

Raw + 3-LM 35.95 %

Raw + Topic-LM 40.63 %

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Page 31: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Outline

RecognitionPostal ApplicationParadigms

SearchOCR accuracy

FusionLexicon ReductionStatistical Topic Models

Document SearchWord Spotting

Page 32: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Indexing Retrieval

Handwriting Recognition

Page 33: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Vector IR Model (TF-IDF)

Set of terms {ti}; Set of documents {dj} of length {Lj}

Term Frequency (TF)

Inverted Document Frequency-IDF

Query TF

Similarity

j

jiji L

freqtf ,

,

=

}0 |{#}{#

log, >

=jij

ji freqd

didf

⎩⎨⎧

=otherwise ,0

query in is if ,1,

qttf i

qi

qii

ijij tfidftfqd ,,),(sim ⋅⋅= ∑

jitf ,terms

back 0.024

.

.

.

0.008pain

}pain"" ,back"{"=q

.

.

.

.

.

.

.

.

.

iidf

4.1

2.4

.

.

.

.

.

.

.

.

.

×

×

qitf ,

1

1

×

×

0...0

0...0

0...0

Σ),sim( qd j

[Baeza-Yates 1999]

Page 34: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

OCR- Based IR [Prior Work]

[Mittendorf, SDAIR96; Ohta, ICDAR97; Jing, CL02]Model common OCR errorsMachine print

[Rath, SIGIR04]Learn word pdf (over image features)Requires large annotated training setASSUME

Perfect segmentationSingle writer

[Howe, SIGIR2005]ASSUMERanks obey Zipfian distributionSegmentation assumed perfect

Page 35: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Required (Assumed) Inputs

Word segmentation result

Word recognition likelihoods

Estimation

: word images]...[ 21 Lwwww =r

∑=

=L

kkiji wtfreqE

1, )|Pr(}{

)|pain""Pr( kw 0.02 0.01 0.2 0.01 0.01

∑}{ ,pain"" jfreqE

…Doc dj

[Rath 04, Howe 05]

Page 36: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Estimating Term Frequency

( ) ε+=∑wI

wiwji ItIfreq )|Pr(Pr}{E ,

wI)Pr( wI

)head"Pr(" w|I)arm"Pr(" w|I)pelvis"Pr(" w|I

...

1 1 5.0 1 ..................

2.0

05.001.0

7.007.0

01.0... ... ... ...

8.001.0

002.0 01.007.003.0

,...}pelvis"",arm"",head""{:}{ 210 === tttti

( )

...07.0101.05.07.0105.01

)|arm""Pr(Pr

}{E ,1

+×+×+×+×=

=∑wI

ww

j

II

freqdj

[Cao & Govindaraju, ICDAR 2007]

Page 37: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Estimating Segmentation

Word Segmentation Gap between adjacent connected components above a threshold D

Generate multiple hypotheses with multiple D

If hypothesis Iw overlaps m other hypotheses, then

( )wIPr

( )1

1Pr+

=m

Iw

d > D

3 hypotheses

( )wIPr21

31

21

m 1 2 1

[Cao & Govindaraju, ICDAR 2007]

Page 38: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Top-Rank (Top-S candidates involved)

Weighted Top-Rank

Empirical

rate OCR )1(R- toprate OCR R- top)|Pr( −−=wi It

⎪⎩

⎪⎨⎧ ≤≤

= otherwise ,0

)rank(1 if ,1)|Pr(

StSIt i

wi

))rank((R it=

Word RecognitionPrior Work )|Pr( wi It

∑−

⋅=

i

d

i

d

iwi

i

i

et

etIt2

2

2

2

2

2

)Pr(

)Pr()|Pr(σ

σ

[Cao & Govindaraju, ICDAR 2007]

Page 39: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Results •16303 word images (342 forms)• Automatic segmentation: 63% correct; 32% under; 5% over•Lexicon 4405 words•Query: 22 (1-3 words)

n = 1 2 5 10

Top n word recognition rate 20% 27% 35% 42%

[Cao & Govindaraju, ICDAR 2007]

Page 40: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

: observation series

:word segmentation hypothesis

:a decoded term sequence

:word sequence segmentation probability

:word sequence recognition probability

:number of ti in

]...[ 21 Toooo =r

]...[ 21 Lwwww =r

]...[ 21 Lττττ =r

)|Pr( ow rr

)|Pr( wrrτ

)(# τrit τr

MMSE Estimation

1)..." coin scene ambulatoryPt ("# scene"" =

)(#)|Pr()|Pr(}{ , τττ

rrrrrrr it

wji wowfreqE ⋅⋅= ∑∑

Pt ambulatory scene in co …

Page 41: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Word Gap

fv1: Euclidian dist between bounding boxes

fv2: Shortest white run between two CC’s

fv3: Distance between convex hulls

)|()Pr()|()Pr()|()Pr()|Pr(

validNonfvpvalidNonValidfvpValidValidfvpValidfvValid

−−+=

Likelihoods estimated using Parzen window and Gaussian kernel

[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]

Page 42: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Word Recognition Likelihood

),( itws

)Imposter|Pr()ImposterPr()Genuine|Pr()GenuinePr()Genuine|Pr()GenuinePr(

)|GenuinePr()(UBM

sss

ss

+

==

)),((UBM)|( ii twstwp ∝

[Kim and Govindaraju, T-PAMI 1997; Cao et al ICFHR 2008]

Page 43: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

S: number of top-candidates retained in the OCR’ed text

VM: Vector Model

PM: a Probabilistic IR Model

Naive MMSE Estimation:

MMSE Estimation:

MAP and R-Precision Values of IR Tests

0.1614 0.16090.11450.12810.12690.1171

0.1577 0.15140.2042

0.16740.1491

0.1675 0.16350.1897

00.050.1

0.150.2

0.25

OCR'ed Text(S=1)

OCR'ed Text(S=3)

OCR'ed Text(S=7)

OCR'ed Text(S=15)

VM + HREstimation

PM + NaiveMMSE

Estimation

VM + MMSEEstimation

MAP

R-Prec

[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]

Page 44: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]

Page 45: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Word Spotting

Page 46: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Error‐prone segmentation

Manual labeling

Poor performance in multiple writer scenarios 

Image Based Methods

(Rath et al 07, IJDAR)

Page 47: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

CorpusWashington’s manuscripts

MAP Performance40.98% (2372 good quality images16.5% (3262 poor quality images)

Query: Both Image and Text

Script specific Upper/ lower profile structural features

Observation density

),Pr( fvwrdPosterior word recognition probability

∑=

wrd

fvwrdfvwrdfvwrd

),Pr(),Pr()|Pr(

[Rath et al, CVPR 2003]

Keyword SpottingPrevious Work

Page 48: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Keyword SpottingPrevious Work

Matching in feature spaceMatching GSC features of two word images: 512 bitsSensitive to noise and character segmentation

Corpus: 9312 word images (3104 for queries and 6208 for tests) from 776 individuals, 4 words

R Precision: GSC: 45.5%, 56.59%, 54.11%, 62.04%DTW: 35.53%, 38.65%, 44.39%, 55.23%

1024-bit GSC feature

[Srihari, et al, SPIE 2004]

Page 49: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Template Free Word SpottingMatching Gabor features of two word images. Posterior probability estimate from SVM OCR

Corpus: 12 medical forms with 5295 character images.101 samples of 6 keywords

MAP Performance67.1% compared to 12.6% by DTW

V1 V2 V3 V4Vw = [V1T V2T V3T V4T]T

))|(ln(1),(1∑=

=

−=ni

iiP vcP

nVwwC i

Probabilistic Similarity

[Cao & Govindaraju, ICAPR 2007]

Keyword SpottingPrevious Work

Page 50: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Probabilistic Indexing

)|Pr( x )1( x...x )1( x),sim :similarityquery -Word

components and between )(y probabilit gap Word

,...,,

11

1

1

wqq(w

cc

cccw

jjii

kkk

jii

σσσσ

σ

⋅−−=

−−

+

+

w

c1 c2 c3 c4

σ0 σ1 σ2 σ3 σ4

[Cao, Bharadwaj & Govindaraju, PR 2009]

Page 51: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Summary•Handwriting Recognition remains a challenging task despite success in postal applications

•Need for improved search technologies to access handwritten documents on the web

•Statistical topic models can help in document categorization and lexicon reduction

•Document indexing can be performed by MMSE modeling that integrates segmentation, language models, and recognition

•Word spotting can be performed by indexing image level features and on OCR results

Page 52: Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007] Outline Recognition

Thank You