Top Banner
2. Architecture and Data Structures A quick tour of the Tesseract Code Ray Smith, Google Inc.
15
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2 architecture anddatastructures

2. Architecture and Data StructuresA quick tour of the Tesseract Code

Ray Smith, Google Inc.

Page 2: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

A Note about the Coordinate System● The pixel edges are aligned with integer coordinates.● (0, 0) is at bottom-left.● Width = right - left => no silly +1/-1.

Note: The API exposes a more common top-down system.0 21

0

2

1

Page 3: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Nominally a pipeline, but not really, as there is a lot of re-visiting ofold decisions.

Tesseract System Architecture

Page 4: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Tesseract Word Recognizer

Page 5: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

The ‘C’ Legacy

● Large chunks of the code written originally in C.● Major rewrite in ~1991 with new C++ code.● C->C++ migration gradual over time since.● Majority of global functions now live in a convenience directory

structure class. (For thread compatibility purposes.)

Page 6: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Directory Structure ~ Functional Architecture

API

ccutilcutil

ccstruct

dict

classify

wordrectextord

ccmain

CCUtil

CUtilCCStruct

Classify

TessBaseAPI

Tesseract

WordrecTextord

Dict

cube

Page 7: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Key Data Structures = Page Hierarchy

BLOCK

ROW

WERD

PAGE_RES

BLOB_CHOICE

C_OUTLINE

C_BLOB

WERD_CHOICEWERD_RES

ROW_RES

BLOCK_RES

BLOBNBOX

TO_ROW

TO_BLOCKWorkingPartSet

ColPartition

TPOINT

EDGEPT

TWERD

TESSLINE

TBLOB

Layout (old) Layout Normalized outlines

ResultsCore page outlines

Page 8: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Software Engineering - Building Blocks

UNICHARSET

GenericVector ELIST CLIST

STRING

TBOX

FCOORDICOORD

ContainersCoordinates

Text

Page 9: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Key Parts of the Call HierarchyTessBaseAPI::Recognize

Tesseract::SegmentPage

Tesseract::classify_word_and_language

Tesseract::recog_all_words

Textord::TextordPageTesseract::AutoPageSeg

Classify::AdaptiveClassifier LanguageModel::UpdateState

Tesseract::chop_word_main

Wordrec::SegSearch

Page 10: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Tesseract’s List Implementation

● Predates STL● Allows control over ownership of list elements● Uses nasty macros instead of templates

Page 11: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

List Example

tordmain.cpp:float Textord::filter_noise_blobs( BLOBNBOX_LIST *src_list, // original list BLOBNBOX_LIST *noise_list, // noise list BLOBNBOX_LIST *small_list) { // small blobs BLOBNBOX_IT src_it(src_list); // iterators BLOBNBOX_IT noise_it(noise_list); BLOBNBOX_IT small_it(small_list); for (src_it.mark_cycle_pt(); !src_it.cycled_list(); src_it.forward()) { blob = src_it.data(); if (blob->bounding_box().height() < textord_max_noise_size) noise_it.add_after_then_move(src_it.extract()); else if (blob->enclosed_area() >= blob->bounding_box().area() * textord_noise_area_ratio) small_it.add_after_then_move(src_it.extract()); }

blobbox.h:class BLOBNBOX : public ELIST_LINK {…};// Defines classes:// BLOBNBOX_LIST: a list of BLOBNBOX// BLOBNBOX_IT: list iteratorELISTIZEH(BLOBNBOX)

blobbox.cpp:// Implementation of some of the// list functions.ELISTIZE(BLOBNBOX)

Page 12: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

TessBaseAPI : Simple example

Main API class provides initialization, image input, text/hOCR/PDF output:TessBaseAPI api;api.Init(NULL, “eng”);Pix* pix = pixRead(“phototest.tif”);api.SetImage(pix);char* text = api.GetUTF8Text();printf(“%s\n”, text);delete [] text;pixDestroy(&pix);

Page 13: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

TessBaseAPI : Multipage example

TessBaseAPI api;api.Init(NULL, “eng”);tesseract::TessResultRenderer* renderer = new tesseract::TessPDFRenderer(api.GetDatapath());api.ProcessPages(filename, NULL, 0, renderer);const char* data;inT32 data_len;if (renderer->GetOutput(&data, &data_len)) { fwrite(data, 1, data_len, fout); fclose(fout);}

Page 14: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

ResultIterator for getting the real details

ResultIterator* it = api.GetIterator();do { int left, top, right, bottom; if (it->BoundingBox(RIL_WORD, &left, &top, &right, &bottom)) { char* text = it->GetUTF8Text(RIL_WORD); printf("%s %d %d %d %d\n", text, left, top, right, bottom); delete [] text; }} while (it->Next(RIL_WORD));delete it;

Page 15: 2 architecture anddatastructures

Tesseract Tutorial: DAS 2014 Tours France

Thanks for Listening!

Questions?