The Elephant in the Library - Integrating Hadoop

SCAPE

Clemens Neudecker Sven Schlarb@cneudecker @SvenSchlarb

The Elephant in the LibraryIntegrating Hadoop

Contents

1. Background: Digitization of cultural heritage

2. Numbers: Scaling up!

3. Challenges: Use cases and scenarios

4. Outlook

1. Background

“The digital revolution is far more significant than the invention of

writing or even of printing”Douglas Engelbart

Then

Our libraries

• The Hague, Netherlands• Founded in 1798• 120.000 visitors per year• 6 million documents• 260 FTE

www.kb.nl

• Vienna, Austria• Founded in 14th century • 300.000 visitors per year• 8 million documents• 300 FTE

www.onb.ac.at

Digitization

Libraries are rapidly transforming from physical…

to digital…

Transformation

Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk

Now

Digital Preservation

Our data – cultural heritage

• Traditionally• Bibliographic and other metadata• Images (Portraits/Pictures, Maps, Posters, etc.)• Text (Books, Articles, Newspapers, etc.)

• More recently• Audio/Video• Websites, Blogs, Twitter, Social Networks• Research Data/Raw Data• Software? Apps?

2. Numbers

“A good decision is based on knowledge and not on numbers”

Plato, 400 BC

Numbers (I)National Library of the Netherlands

• Digital objects• > 500 million files• 18 million digital publications (+ 2M/year)• 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year)• 730.000 websites (+ 170k/year)

• Storage• 1.3 PB (currently 458 TB used)• Growing approx. 150 TB a year

Numbers (II)Austrian National Library

• Digital objects• 600.000 volumes being digitised during the next

years (currently 120.000 volumes, 40 million pages)

• 10 million newspapers and legal texts• 1.16 billion files in web archive from

> 1 million domains• Several 100.000 images and portraits

• Storage• 84 TB• Growing approx. 15 TB a year

Numbers (III)

• Google Books Project• 2012: 20 million books scanned

(approx. 7,000,000,000 pages)• www.books.google.com

• Europeana• 2012: 25 million digital objects• All metadata licensed CC-0• www.europeana.eu/portal

Numbers (IV)

• Hathi Trust• 3,721,702,950 scanned pages• 477 TBytes• www.hathitrust.org

• Internet Archive• 245 billion web pages archived• 10 PBytes• www.archive.org

Numbers (V)

• What can we expect?• Enumerate 2012: only about 4% digitised so far• Strong growth of born digital information

Source: security.networksasia.net Source: www.idc.com

3. Challenges

“What do you do with a million books?” Gregory Crane, 2006

Making it scale

Scalability in terms of …• size• number• complexity • heterogeneity

SCAPE

• SCAPE = SCAlable Preservation Environments• €8.6M EU funding, Feb 2011 – July 2014• 20 partners from public sector, academia, industry• Main objectives:

• Scalability• Automation• Planning

www.scape-project.eu

Use cases (I)

• Document recognition: From image to XML • Business case:

• Better presentation options• Creation of eBooks• Full-text indexing

Use cases (II)

• File type migration: JP2k TIFF

• Business case: • Originally migration

to JP2k to reduce storage costs

• Reverse process used in case JP2k becomes obsolete

Use cases (III)

• Web archiving: Characterization of web content

• Business case: • What is in a Top Level Domain?• What is the distribution of file formats?• http://www.openplanetsfoundation.org/blogs/2013-01-

09-year-fits

xkcd.com/688

Use cases (IV)

• Digital Humanities: Making sense of the millions

• Business case: • Text mining & NLP• Statistical analysis• Semantic enrichment• Visualizations Source: www.open.ac.uk/

Enter the Elephants…

Source: Biopics

Experimental Cluster

Apache Tomcat Web Application

Taverna Server(REST API)

Hadoop Jobtracker

File server

Cluster

Execution environment

• Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server)

Scenarios (I)Log file analysis

20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 20020110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 20020110830130710 9684 46 16 3 image/gif http://URL at IP 9794 20020110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 20020110830130704 9684 46 16 3 text/html http://URL at IP 13149 20020110830130712 9684 46 16 3 image/gif http://URL at IP 2285 20020110830130712 9684 46 16 3 text/html http://URL at IP 415 30120110830130710 9684 46 16 3 text/html http://URL at IP 7873 20020110830130712 9684 46 16 3 text/html http://URL at IP 632 30220110830130712 9684 46 16 3 image/png http://URL at IP 679 200

→ Run file type identification on archived web content

Scenarios (II)Web archiving: File format identification

(W)ARC Container

JPG

GIF

HTM

HTM

MID

(W)ARC RecordReader

based onHERITRIX

Web crawlerread/write (W)ARC

MapReduce

JPG Apache Tikadetect MIME

MapReduce

image/jpg

image/jpg 1image/gif 1text/html 2audio/midi 1

→ Using MapReduce to calculate statistics

Scenarios (II)Web archiving: File format identification

TIKA 1.0DROID 6.01

• Risk of format obsolescence • Quality assurance

• File format validation• Original/target image

comparison• Imagine runtime of 1 minute

per image for 200 million pages ...

Scenarios (III)File format migration

Parallel execution of file format validation using Mapper●Jpylyzer (Python)●Jhove2 (Java)

●Feature extraction requires sharing resources between processing steps

●Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison

Scenarios (IV)Book page analysis

Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API

find

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS

find

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books24 Million pages

HtmlPathCreator SequenceFileCreatorSequenceFile creation

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML Parsing

Create Hive table and load generated data into the Hive database

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

: ~ 6 h60.000 books24 Million pages

HiveSelect

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001 2250 1870

Z119585409/00000002 2150 2100

Z119585409/00000003 2125 2015

Z119585409/00000004 2125 1350

Z119585409/00000005 2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic Queries

Perform a simple Hive query to test if the database has been created successfully

Outlook

“Progress generally appears much greater than it really is”

Johan Nestroy, 1847

What have WE learned?

• We need to carefully assess the efforts for data preparation vs. the actual processing load

• HDFS prefers large files over many small ones,is basically “append-only”

• There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout

What can YOU do?

• Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna(See http://www.scape-project.eu/events )

• Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable

• Follow us at @SCAPEProject and spread the word!

What’s in it for US?

• Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere

• Ensuring our cultural history is not lost

• New innovative applications using cultural heritage data (education, creative industries)

Thank you! Questions?(btw, we’re hiring)

www.kb.nlwww.onb.ac.at

www.scape-project.euwww.openplanetsfoundation.org

The Elephant in the Library - Integrating Hadoop

Technology

texthtml http

imagegif http

imagejpeg http

web pages

digital preservation

imagepng http

file format identification

digital informationsource