Real-Time Document Localization in Natural Images by ......Deep Convolutional Neural Networks (CNNs), on the other hand, have shown to be extremely robust to variations in background
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Real-time Document Localization in Natural Imagesby Recursive Application of a CNN
Khurram Javed∗, Faisal Shafait†School of Electrical Engineering and Computer Science
Abstract—Seamless integration of information from digital andpaper documents is crucial for efficient knowledge management.One convenient way to achieve this is to digitize a documentfrom a natural image. This requires precise localization of thedocument in the image. Several methods have been proposed tosolve this problem but they rely on traditional image processingtechniques which are not robust to extreme viewpoint andbackground variations. Deep Convolutional Neural Networks(CNNs), on the other hand, have shown to be extremely robust tovariations in background and viewpoint in object detection andclassification tasks. Inspired by their robustness and generality,we propose a novel CNN based method to accurately localizedocuments in real-time.
We model localization problem as a key point detectionproblem. The four corners of the documents are jointly pre-dicted by a Deep Convolutional Neural Network. We thenrefine our prediction using a novel recursive application of aCNN. Performance of the system is evaluated on ICDAR 2015SmartDoc Competition 1 dataset. The results are comparableto state of the art on simple backgrounds and improve thestate of the art to 94% from the previous 86% on the com-plex background. Code, dataset, and models are available at:https://github.com/KhurramJaved96/Recursive-CNNs.
Index Terms—Document capture, Real-time, Document detec-tion, Machine learning, CNN.
I. INTRODUCTION
Documents exist in both paper and digital form in our
everyday life. Paper documents are easier to carry, read, and
share whereas digital documents are easier to search, index,
and store. For efficient knowledge management, it is often
required to convert a paper document to digital format. In the
past this has been achieved by scanning the document using a
flat-bed scanner followed by application of OCR and document
analysis techniques to extract the content from the scanned
image. Scanners, however, are slow, costly, and not portable.
Smart-phones, on the other hand, have become extremely
accessible in the past decade and can take high resolution
images. Consequently, there has been a recent trend to use
smart-phones to digitize paper documents.
A natural image of a document taken by a smart-phone
can not be directly digitized using the methods developed for
scanned images. This is because hand-held camera images
pose challenges such as presence of background texture or
objects, perspective distortions, variations in light and illumi-
nation, and motion blur [1].
Document segmentation from the background is one of the
major challenges of digitization from a natural image. An
image of a document often contains the surface on which
the document was placed to take the picture. It can also
contain other non-document objects in the frame. Precise
localization of document, often called document segmentation,
from such an image is an essential first step required for
digitization. Some of the challenges involved in localization
are variations in document and background types, perspective
distortions, shadows, and other objects in the image. An
ideal method should be robust to these challenges. It should
also be able to run on a smart-phone in a reasonable time.
We present a data-driven method to solve the problem of
document segmentation that satisfies all the requirements of
a practical system. More precisely, we propose a method that
uses deep convolutional neural networks combined with an
algorithm to localize documents on a variety of backgrounds.
Our method outperforms state of the art in literature on the
complex background and can run in real-time on a smart-
phone. Furthermore, it is more general compared to previous
methods in the sense that it can be adapted to be robust to
more challenges simply by training on better representative
data. Traditional image processing methods, on the other hand,
have at-least some intrinsic limitations because of assumptions
made about the environment.
II. LITERATURE REVIEW
One of the earliest schemes of document localization relied
on a model of the background for segmentation [2] The back-
ground was modeled by taking a picture of the background
without the document. The document was then placed in the
environment and another picture was taken. The difference
between the two images was used to identify the location of
the document. This method had the obvious drawbacks that
the camera had to be kept stationary and two pictures had to
be taken.
Different methodologies have been proposed since then
for this task, which can be broadly classified as camera
calibration based systems [4], [5], document content analysis
based systems, [6], [7], [8], [9], [10], and geometric analysis
based systems [13], [14], [15], [16].
Camera calibration based systems utilize information of the
position and angle of the camera relative to the expected doc-
2017 14th IAPR International Conference on Document Analysis and Recognition
has two limitations; first, it can not tell us how well our system
generalizes to unseen backgrounds since all backgrounds were
used for training. Secondly, it can not tell us how well our
system generalizes to new documents on similar background.
Experiment 2: To address the first limitation and quantify
generalization on unseen background, we cross-validate our
system by removing each background from training set and
then testing on the removed background.
Experiment 3: To address the second limitation and quantify
generalization on unseen document, we test our system just on
the ’Tax’ document. All instances of this particular document
were excluded from the train, validation, and test set 1.
B. Evaluation Protocol
We evaluate the performance of the system using the method
described in the SmartDoc competition report [11], [24]. We
first remove the perspective transform of the ground truth
quadrilateral (G) and predicted quadrilateral (S) using the
document size. Let’s call the corrected co-ordinates G′
and
S′. Then for each frame, we compute Jaccard index (JI) as
follows:
JI =area(G
′ ∩ S′)
area(G′ ∪ S′)
Overall score is the average of JI of all the frames.
C. Results
The results of experiment 2 and 3 are shown in Fig. 8. It
can be seen that our system generalizes to unseen document
types perfectly. This is not surprising since it is not possible
for our model to rely on features of document’s layout or
content because of the low resolution of input image (Fig. 9).
From the results we can also infer that our system generalizes
well to unseen simple backgrounds. Even when background 1,
2, 3, and 4 were removed from training, the accuracy on
them dropped only slightly. Generalization to the more difficult
background 5, on the other hand, is a different story.
Accuracy on background 5 dropped to 0.66 from 0.94 when
all samples of background 5 were removed from training. This
1We did not feel the need to cross validate on all document types becausegeneralization on unseen document was expected because of low resolutionof input image as shown in Fig. 9.
Bg 1 Bg 2 Bg 3 Bg 4 Bg 5
0.2
0.4
0.6
0.8
1 0.98
0.97
0.98
0.97
0.95
0.98
0.98
0.98
0.97
0.95
0.98
0.97
0.97
0.96
0.67
Background Version
Acc
ura
cy
Test set Unseen Document Unseen Background
Fig. 8: Experiment 2 and 3 results. System shows good gener-
alization on unseen simple backgrounds and unseen document
type
significant drop can be explained by the fact that apart from
background 5 images, almost all images in our train set contain
only a single object (document) on a surface (There were only
10 exceptions). Background 5, on the other hand, has every
day objects placed on top of and near the document. This
dissimilarity in train and test set could be the reason of bad
performance. We suspect that by simply adding samples which
contain arbitrary everyday objects, we would get much better
generalization. We leave this idea untested for now.
D. Performance
The system is tested on an Intel i5-4200U 1.6 Ghz CPU
with 8 GB ram on 1920 x 1080 images. Total time depends
on the hyper-parameter Retain Factor (RF). Time taken with
different values of RF is shown in Table. II.
It should be noted that the system is not implemented with
efficiency in mind. In the current implementation, the four
109109109109109109
Fig. 9: Two different documents on similar background look
identical. Low resolution prevents the model from relying on
content or structure of document allowing it to generalize on
unseen documents.
TABLE II: Relationship between time, Retain Factor, and
accuracy. It is possible to get massive performance gains at
corners images are processed sequentially. It is possible to
process them simultaneously by passing them through the
model as a batch of four images. This should give us a
noticeable performance boost.
Lastly, it is easy to see that given N = number of pixels in
the image, the growth rate of the corner refinement algorithm
is Θ(√N). This is because both length and width of the image
are cut by the RF in each iteration.
VI. CONCLUSION
In this paper we present a novel approach to localize
document in a natural image. We model the problem of
localization as a key point detection problem. By using a deep
convolutional network and recursive application of a shallow
convolutional network, we demonstrate that our method gives
results comparable to state of the art on simple backgrounds
and improves the state of the art to 94% from the previous
86% on the complex background. We also demonstrate that
our system is able to generalize well on new documents
and unseen simple backgrounds. Furthermore, we suspect
that generalization on unseen complex backgrounds can be
improved either by adding more complex images in training
or by synthetically degrading the training images by adding
patches of different colors. We believe the latter because in
a 32 x 32 downscaled complex image, the objects lose their
distinctive features and appear more like a patch of their most
common color.
REFERENCES
[1] J. Liang, D. Doermann, et al. ”Camera-based analysis of text anddocuments: a survey.” International Journal on Document Analysis andRecognition, vol. 7, no. 2, pp. 84-104, 2005.
[2] H. Lampert H, T. Braun , et al. ”Oblivious document capture andreal-time retrieval.” Proceedings. Camera-Based Document Analysis andRecognition, 79-86, 2005.
[3] F. Chen, S. Carter, et al. ”SmartDCap: semi-automatic capture of higherquality document images from a smartphone.” Proceedings. internationalconference on Intelligent user interfaces, ACM, 2013.
[4] E Guillou, D. Meneveaux, et al. ”Using vanishing points for cameracalibration and coarse 3D reconstruction from a single image.” TheVisual Computer 16.7 : 396-410, 2000
[5] C. Kofler, D. Keysers, et al. Gestural Interaction for an AutomaticDocument Capture System, 2007.
[6] P. Clark, M. Mirmehdi. ”Rectifying perspective views of text in 3Dscenes using vanishing points.” Pattern Recognition 36.11 : 2673-2686,2003
[7] S. Lu,, M. Chen, et al. ”Perspective rectification of document imagesusing fuzzy set and morphological operations.” Image and VisionComputing 23.5: 541-553, 2005
[8] S. Lu, L. Tan. ”The restoration of camera documents through imagesegmentation.” International Workshop on Document Analysis Systems.Springer Berlin Heidelberg, 2006.
[9] L. Miao, S. Peng. ”Perspective rectification of document images basedon morphology.” International Conference on Computational Intelli-gence and Security. Vol. 2. IEEE, 2006.
[10] N. Stamatopoulos,, B. Gatos, et al. ”Automatic borders detection ofcamera document images.” 2nd International Workshop on Camera-Based Document Analysis and Recognition, 2007.
[11] JC. Burie, J. Chazalon, et al. ”ICDAR2015 competition on smartphonedocument capture and OCR (SmartDoc).”13th International Conferenceon Document Analysis and Recognition, IEEE, 2015.
[12] LRS Leal, BLD Bezerra. ”Smartphone camera document detection viaGeodesic Object Proposals.” Computational Intelligence (LA-CCI), 2016IEEE Latin American Conference on. IEEE, 2016.
[13] WH. Kim, Woong Hee, J. Hwang, et al. ”Document capturing methodwith a camera using robust feature points detection.” InternationalConference on Digital Image Computing Techniques and Applications,IEEE, 2011.
[14] Z. Zhang, L. He et al. ”Whiteboard scanning and image enhancement.”Digital Signal Processing 17.2 : 414-432, 2007.
[15] F. Yusaku, K. Fujii. ”Perspective rectification for mobile phone camera-based documents using a hybrid approach to vanishing point detection.”2nd International Workshop on Camera-Based Document Analysis andRecognition, 2007.
[16] X.C. Yin, J. Sun, et al. ”A multi-stage strategy to perspective rectificationfor mobile phone camera-based document images.” Ninth InternationalConference on Document Analysis and Recognition. Vol. 2. IEEE, 2007.
[17] Y. Sun, X. Wang, et al. ”Deep convolutional network cascade forfacial point detection.” Conference on Computer Vision and PatternRecognition, 2013.
[18] A. Krizhevsky, I. Sutskever, et al. ”ImageNet classification with deepconvolutional neural networks.” Advances Neural Information Process-ing Systems 25 : 1097 - 1105, 2012
[19] B. Hariharan, P. Arbelez, et al. ”Hypercolumns for object segmentationand fine-grained localization” Computer Vision and Pattern Recognition,2015.
[20] A. Newell, K. Yang, et al. ”Stacked hourglass networks for humanpose estimation.” European Conference on Computer Vision. SpringerInternational Publishing, 2016.
[21] N. Srivastava, G. Hinton, et al. ”Dropout: a simple way to prevent neuralnetworks from overfitting.” The Journal of Machine Learning Research,pages 19291958, 2014.
[22] M. Abadi, A. Agarwal, et al. ”TensorFlow: Large-scale machine learningon heterogeneous systems, 2015”. Software available from tensor-flow.org.
[23] DP. Kingma, J. Ba ”Adam: A method for stochastic optimization” CoRR,abs/1412.6980, 2014
[24] M. Everingham, L.V. Gool, et al. ”The pascal visual object classes (voc)challenge.” International journal of computer vision 88.2 : 303-338,2010.