Top Banner
Emancipating Digital Data: The Lincoln Di iti ti P j t Digitization Project Peter Bajcsy, PhD R hS i ti t NCSA - Research Scientist, NCSA - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
31

Overview of Lincoln Paper Design

Aug 31, 2014

Download

Technology

pbajcsy

This set of slides has been presented to the Illinois Program for Research in the Humanities at the University of Illinois at Urbana-Champaign on 02-27-2009
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overview of Lincoln Paper Design

Emancipating Digital Data: The Lincoln Di iti ti P j tDigitization Project

Peter Bajcsy, PhDR h S i ti t NCSA- Research Scientist, NCSA

- Adjunct Assistant Professor ECE & CS at UIUC- Associate Director Center forAssociate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Page 2: Overview of Lincoln Paper Design

Outline• Introduction to Lincoln project

• Emancipating Digital Data: The Lincoln Digitization Project• From Large Volumes Of Scanned Lincoln Papers To Virtual

Observatories • Image Croppingg pp g• Georeferencing and Re-Projections of Historical Maps• System Architecture for Web-based Delivery of

I f ti d S iInformation and Services• Delivering Layered Information and Providing Services• Summary• Summary

Page 3: Overview of Lincoln Paper Design

Acknowledgement• Funding Agencies:

• NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC Provost NCSA International Partners Google Summer CodeProvost, NCSA International Partners, Google Summer Code

• Full Time Employees:• Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry,

Jason Kastner and Luigi MariniJason Kastner and Luigi Marini • Students:

• Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William McFadden Chandra Ramachandran Ben Raichel MaryamMcFadden, Chandra Ramachandran, Ben Raichel, Maryam Moslemi Naeini

• Collaborators on Lincoln Project:• Daniel Stowell & Stacy McDermott Lincoln Library in Springfield• Daniel Stowell & Stacy McDermott, Lincoln Library in Springfield,

IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin Casares, and Jose Castro from Instituto Tecnológico de Costa Rica (ITCR); Piotr Wendykier and James Nagy from Emory ( ); y gy yUniversity Atlanta

Page 4: Overview of Lincoln Paper Design

FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES

INTRODUCTION

Imaginations unbound

Page 5: Overview of Lincoln Paper Design

Background

• The Papers of Abraham Lincoln is a research initiative with the ultimate goal of making allinitiative with the ultimate goal of making all writings by America's 16th president available on-line.

• Complex workflow process from paper documents to on-line virtual observatories

• Multiple end usersR h• Researchers

• General public

Page 6: Overview of Lincoln Paper Design

Input: Paper Copies of Docs & Metadata

Page 7: Overview of Lincoln Paper Design

Output: Multi-dimensional Views

The Lincoln Log, A Chronology compiled by

Hyperlink to the Lincoln Log (temporal

g, gy p ythe Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/

DIMENSIONSHyperlink to the Lincoln Log (temporalrepresentation)

Hyperlink to the Markers (spatialrepresentation)

Hyperlinks to the Image scans (document content).

Page 8: Overview of Lincoln Paper Design

Output: Hyperlinked Multi-media Views

• Audio (e.g., music of Lincoln’s time)I d• Images and maps

• Video• 3D objects (e g musical instruments)• 3D objects (e.g., musical instruments)

IMAGES

SONGS

Imaginations unbound

SONGS

Page 9: Overview of Lincoln Paper Design

Output: Services to Search, Display and Transcribe Digital DataTranscribe Digital Data

Google service

The Lincoln Log, A Chronology compiled by Transcription serviceg, gy p ythe Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/

Transcription service

Search service

Page 10: Overview of Lincoln Paper Design

Output: On-line Virtual Observatory

• Digital Information Organization• Multi-dimensional views in time, space and document

dimensions• Hyperlinked multi-media views including all existing n-

dimensional data

• Computational Services to Operate on Digital Data• Search• Layered display with third party data• Layered display with third party data• Transcription of documents

• Educational Services to Enable Learning• Simple demonstrations• Homework exercises• Support of forensic studiesSupport of forensic studies

Imaginations unbound

Page 11: Overview of Lincoln Paper Design

From Input to Output: A Few Key Components1. Cropping of scanned documents (algorithm, accuracy &

robustness, scalability, computational resources).2 Cleaning and parsing of metadata obtained from The2. Cleaning and parsing of metadata obtained from The

Lincoln Log and The Papers of Abraham Lincoln in Springfield (Lat, Lng, places, ASCII characters, populating MySQL Database etc.)

3. Designing an underlying architecture of information storage and retrievalg

4. Geo-referencing and re-projection of historical maps.5. Building web-based interfaces and providing services

(Programming against Google Maps API Database(Programming against Google Maps API, Database Ajax/Javascript requests using PHP and mySQL). http://isda.ncsa.uiuc.edu/lpapers/index.html

Page 12: Overview of Lincoln Paper Design

FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES

IMAGE CROPPING

Imaginations unbound

Page 13: Overview of Lincoln Paper Design

Image Cropping: Understanding Variability of Document Scansof Document Scans

• Background paper color andBackground paper color and intensity

• Ink color and intensity• Density of writingDensity of writing• Color scale bar position

• Task: Automatically• Task: Automatically classify images for pre-processing and p gremove the Kodak color scale bar if neededneeded.

Page 14: Overview of Lincoln Paper Design

Image Cropping Approach

Training

Classify Crop

Output

Page 15: Overview of Lincoln Paper Design

Humanities & High Performance Computing

• Assuming that the world is perfect ….Image cropping 300 000 files times 60 seconds per file 5 000• Image cropping: 300,000 files times 60 seconds per file = 5,000 hours = 208.3 days

• Other operations such as file format conversions (TIFF->PDF), pyramid construction for web deploymentpyramid construction for web deployment

• Storage requirements for original (100K-300K images ~ 45 Terabytes), cropped (?) and pyramid representation for fast retrieval over the Internet (?)retrieval over the Internet (?)

• Need to joint forces and form interdisciplinary teams• The storage requirements and preservation – NCSA mass g q p

storage• The CPU requirements – parallel codes to utilize HPC

Imaginations unbound

Page 16: Overview of Lincoln Paper Design

FROM LARGE VOLUMES OF SCANNEDFROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIESOBSERVATORIES

GEO-REFERENCING AND RE-PROJECTIONGEO-REFERENCING AND RE-PROJECTION OF HISTORICAL MAPS

Imaginations unbound

Page 17: Overview of Lincoln Paper Design

Georeferencing Historical Maps

• Goal: to overlay historical maps on top of Google MapsMaps

• Challenges: Geodetic information is not always availableavailable. • The geodetic coordinate system consists of a datum, a

projection, an origin, a unit system and two axis.

T t P j ti G l M WGS84• Target Projection: Google Maps uses WGS84, Mercator projection and a pixel unit system. • Most of the maps of the United States are in conical projection• Most of the maps of the United States are in conical projection,

Lambert Conformal Conic and Albers Equal Area or in Molweide Pseudocylindrical Projection.

Imaginations unbound

Page 18: Overview of Lincoln Paper Design

Layered Geospatial Information: Google Map ExampleExample

Page 19: Overview of Lincoln Paper Design

Geospatial Characteristics: Neighborhoods

Page 20: Overview of Lincoln Paper Design

Example of Map Georeferencing

• Software: Used Global Mapper

Albers equal-area conic Lambert's conformal conic Mollweide pseudocylindrical

Mercator cylindrical

In our case the projection does not have to be exact. Forsmall areas in Molweide projection, for example a simpleperspective correction can be sufficient for the map of the

Imaginations unbound

perspective correction can be sufficient for the map of theUS 1861-1865

Page 21: Overview of Lincoln Paper Design

FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUALLINCOLN PAPERS TO VIRTUAL OBSERVATORIES

DESIGNING AN UNDERLYING ARCHITECTURE OF VIRTUALARCHITECTURE OF VIRTUAL OBSERVATORIES

Imaginations unbound

Page 22: Overview of Lincoln Paper Design

Software Architecture Design

The front-end consists of a HTML file with Google Map loaded, a JavaScriptscript, and a search form with pre-defined data sets. The client-side HTMLand JavaScript files make requests to the server. The server-side consistsof a PHP file which bridges the gap between Ajax and connects to MySQLdatabase. The result is returned as an XML response to the Ajax engine.

Imaginations unbound

Page 23: Overview of Lincoln Paper Design

Data Storage and Organization

Imaginations unbound

Page 24: Overview of Lincoln Paper Design

FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES

DELIVERING LAYERED INFORMATION AND PROVIDING SERVICES

Imaginations unbound

Page 25: Overview of Lincoln Paper Design

Multi Dimensional View of Lincoln Papers

• Delivering Layers of Information (geospatial – historical maps and current maps, temporal – Lincoln log, relational –p p , p g,source & destination links, content – document scans

Page 26: Overview of Lincoln Paper Design

User InterfaceInformation in time, space and documentdocument dimensions.

TiTime

SpaceSpace

Search

Page 27: Overview of Lincoln Paper Design

Providing Search Services

Page 28: Overview of Lincoln Paper Design

Providing Transcription Services

Page 29: Overview of Lincoln Paper Design

Safety Guards for Transcription Services

RFC Valid e-mail [email protected] [email protected] [email protected]

abc 123@example com

Standards for email addresses: RFC822 (published in 1982) defines, amongst other

f f

[email protected] "abc@def"@example.com"Abc@def"@example.com

[email protected] [email protected]

abc+mailbox/[email protected], the format for internet text message (email) addresses.

abc mailbox/department [email protected] !#$%&'*+-/=?^_`.{|}[email protected]

"Fred \"quota\" Bloggs"@example.com"Abc\@def"@example.com

"Fred Bloggs"@example.com "Joe\\Blow"@example.com

RFC Invalid e-mail addressesAbc.example.com (character @ is missing)

[email protected] (character dot(.) is last in local part) [email protected] (character dot(.) is double)

customer/[email protected] [email protected]

!def!xyz%[email protected] [email protected]

@ p ( ( ) )A@b@[email protected] (only one @ is allowed outside quotations marks)

()[]\;:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)

Page 30: Overview of Lincoln Paper Design

What Would You Learn ?

In this example a letter was sent from Fort Randall to President AbrahamLincoln on October 26, 1862. The bits of information about the document(metadata) namely the time, the location of a sender and the location ofPresident Lincoln are known. The letter path is visualized in GoogleMaps, the document can be retrieved from the database and edited.Additionally, user can overlay one of the historical maps. The markers are

iti d ith hi h b d th l tit d d l it d fpositioned with high accuracy based on the latitude and longitude ofhistorical sites.

Page 31: Overview of Lincoln Paper Design

Summary• Design and implementation of automated document

cropping. • Integration of spatial, temporal and document

information.• Design and prototype a web-based user interface to• Design and prototype a web-based user interface to

heterogeneous data.• ---------------------------------------------------------------------• The system is available at

http://isda.ncsa.uiuc.edu/lpapers/search.htmlW ld b it d if ld fi d th t f l• We would be excited if you would find the system useful in your research or education!

Imaginations unbound