Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk 1 & Alexander Schindler 1,2 & Sven Schlarb 3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library
23
Embed
Duplicate detection for quality assurance of document image collections
Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto. In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143. ISBN 978-0-9917997-0-1
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library
Overview
Digital preservation & quality assurance
Digital image preservation workflows
Image duplicate detection
Keypoints and feature descriptors in Computer Vision
Bag of visual words
Results on a real-world data set
2 22.11.2012
SCAPE project and quality assurance
SCAlable Preservation Environments, EU FP7
Preservation Components:
improve and extend existing tools,
develop new ones where necessary,
apply proven approaches like
image and patterns analysis to the
problem of ensuring quality in digital
preservation
3 22.11.2012
Quality assurance in image preservation
Comparison of image content
- automatic image processing worflows (e.g. format conversion)
- reacquisition of images
Duplicate detection
- within a single collection (filtering)
- between collections (merging, comparison)
Solutions:
- page segmention + OCR
- feature based approaches
4 22.11.2012
Book scan sequence with duplicates
5 22.11.2012
Duplicate detection workflow
6 22.11.2012
Keypoint detection and description (1)
Keypoints are detected at salient image regions
A keypoint is described in a descriptor ( = vector of features)
Invariance w.r.t. rotation, scaling or translation
8 22.11.2012
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
Keypoint detection and description (3)
All detections (ordered by scale)
9 22.11.2012
Duplicate detection workflow
10 22.11.2012
Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]
Visual analogy: bag of visual words or bag of features
Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint”
Bag of visual words (1)
11 22.11.2012
12 22.11.2012
Bag of visual words (2)
Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250
Bag of visual words (3)
13 22.11.2012
Duplicate detection workflow
14 22.11.2012
Image comparison / duplicate detection schemes
Comparison of visual histograms – tf (“term frequency”) score
Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity
16 22.11.2012
Spatial verification (2)
17 22.11.2012
Pair of possible duplicates Descriptor matching Estimation of affine transformation