Top Banner
SCAP E Clemens Neudecker Sven Schlarb @cneudecker @SvenSchlarb The Elephant in the Library Integrating Hadoop
48

The Elephant in the Library - Integrating Hadoop

Jun 15, 2015

Download

Technology

cneudecker

The Elephant in the Library - Integrating Hadoop
[with Sven Schlarb]
Hadoop Summit Europe, Beurs van Berlage, 20-21 March 2013, Amsterdam, Netherlands.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Elephant in the Library - Integrating Hadoop

SCAPE

Clemens Neudecker Sven Schlarb@cneudecker @SvenSchlarb

The Elephant in the LibraryIntegrating Hadoop

Page 2: The Elephant in the Library - Integrating Hadoop

Contents

1. Background: Digitization of cultural heritage

2. Numbers: Scaling up!

3. Challenges: Use cases and scenarios

4. Outlook

Page 3: The Elephant in the Library - Integrating Hadoop

1. Background

“The digital revolution is far more significant than the invention of

writing or even of printing”Douglas Engelbart

Page 4: The Elephant in the Library - Integrating Hadoop

Then

Page 5: The Elephant in the Library - Integrating Hadoop

Our libraries

• The Hague, Netherlands• Founded in 1798• 120.000 visitors per year• 6 million documents• 260 FTE

www.kb.nl

• Vienna, Austria• Founded in 14th century • 300.000 visitors per year• 8 million documents• 300 FTE

www.onb.ac.at

Page 6: The Elephant in the Library - Integrating Hadoop

Digitization

Libraries are rapidly transforming from physical…

to digital…

Page 7: The Elephant in the Library - Integrating Hadoop

Transformation

Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk

Page 8: The Elephant in the Library - Integrating Hadoop

Now

Page 9: The Elephant in the Library - Integrating Hadoop

Digital Preservation

Page 10: The Elephant in the Library - Integrating Hadoop

Our data – cultural heritage

• Traditionally• Bibliographic and other metadata• Images (Portraits/Pictures, Maps, Posters, etc.)• Text (Books, Articles, Newspapers, etc.)

• More recently• Audio/Video• Websites, Blogs, Twitter, Social Networks• Research Data/Raw Data• Software? Apps?

Page 11: The Elephant in the Library - Integrating Hadoop

2. Numbers

“A good decision is based on knowledge and not on numbers”

Plato, 400 BC

Page 12: The Elephant in the Library - Integrating Hadoop

Numbers (I)National Library of the Netherlands

• Digital objects• > 500 million files• 18 million digital publications (+ 2M/year)• 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year)• 730.000 websites (+ 170k/year)

• Storage• 1.3 PB (currently 458 TB used)• Growing approx. 150 TB a year

Page 13: The Elephant in the Library - Integrating Hadoop

Numbers (II)Austrian National Library

• Digital objects• 600.000 volumes being digitised during the next

years (currently 120.000 volumes, 40 million pages)

• 10 million newspapers and legal texts• 1.16 billion files in web archive from

> 1 million domains• Several 100.000 images and portraits

• Storage• 84 TB• Growing approx. 15 TB a year

Page 14: The Elephant in the Library - Integrating Hadoop

Numbers (III)

• Google Books Project• 2012: 20 million books scanned

(approx. 7,000,000,000 pages)• www.books.google.com

• Europeana• 2012: 25 million digital objects• All metadata licensed CC-0• www.europeana.eu/portal

Page 15: The Elephant in the Library - Integrating Hadoop

Numbers (IV)

• Hathi Trust• 3,721,702,950 scanned pages• 477 TBytes• www.hathitrust.org

• Internet Archive• 245 billion web pages archived• 10 PBytes• www.archive.org

Page 16: The Elephant in the Library - Integrating Hadoop

Numbers (V)

• What can we expect?• Enumerate 2012: only about 4% digitised so far• Strong growth of born digital information

Source: security.networksasia.net Source: www.idc.com

Page 17: The Elephant in the Library - Integrating Hadoop

3. Challenges

“What do you do with a million books?” Gregory Crane, 2006

Page 18: The Elephant in the Library - Integrating Hadoop

Making it scale

Scalability in terms of …• size• number• complexity • heterogeneity

Page 19: The Elephant in the Library - Integrating Hadoop

SCAPE

• SCAPE = SCAlable Preservation Environments• €8.6M EU funding, Feb 2011 – July 2014• 20 partners from public sector, academia, industry• Main objectives:

• Scalability• Automation• Planning

www.scape-project.eu

Page 20: The Elephant in the Library - Integrating Hadoop

Use cases (I)

• Document recognition: From image to XML • Business case:

• Better presentation options• Creation of eBooks• Full-text indexing

Page 21: The Elephant in the Library - Integrating Hadoop

Use cases (II)

• File type migration: JP2k TIFF

• Business case: • Originally migration

to JP2k to reduce storage costs

• Reverse process used in case JP2k becomes obsolete

Page 22: The Elephant in the Library - Integrating Hadoop

Use cases (III)

• Web archiving: Characterization of web content

• Business case: • What is in a Top Level Domain?• What is the distribution of file formats?• http://www.openplanetsfoundation.org/blogs/2013-01-

09-year-fits

xkcd.com/688

Page 23: The Elephant in the Library - Integrating Hadoop

Use cases (IV)

• Digital Humanities: Making sense of the millions

• Business case: • Text mining & NLP• Statistical analysis• Semantic enrichment• Visualizations Source: www.open.ac.uk/

Page 24: The Elephant in the Library - Integrating Hadoop

Enter the Elephants…

Source: Biopics

Page 25: The Elephant in the Library - Integrating Hadoop

Experimental Cluster

Page 26: The Elephant in the Library - Integrating Hadoop

Apache Tomcat Web Application

Taverna Server(REST API)

Hadoop Jobtracker

File server

Cluster

Execution environment

Page 27: The Elephant in the Library - Integrating Hadoop

• Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server)

Scenarios (I)Log file analysis

20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 20020110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 20020110830130710 9684 46 16 3 image/gif http://URL at IP 9794 20020110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 20020110830130704 9684 46 16 3 text/html http://URL at IP 13149 20020110830130712 9684 46 16 3 image/gif http://URL at IP 2285 20020110830130712 9684 46 16 3 text/html http://URL at IP 415 30120110830130710 9684 46 16 3 text/html http://URL at IP 7873 20020110830130712 9684 46 16 3 text/html http://URL at IP 632 30220110830130712 9684 46 16 3 image/png http://URL at IP 679 200

Page 28: The Elephant in the Library - Integrating Hadoop

→ Run file type identification on archived web content

Scenarios (II)Web archiving: File format identification

(W)ARC Container

JPG

GIF

HTM

HTM

MID

(W)ARC RecordReader

based onHERITRIX

Web crawlerread/write (W)ARC

MapReduce

JPG Apache Tikadetect MIME

MapReduce

image/jpg

image/jpg 1image/gif 1text/html 2audio/midi 1

Page 29: The Elephant in the Library - Integrating Hadoop

→ Using MapReduce to calculate statistics

Scenarios (II)Web archiving: File format identification

TIKA 1.0DROID 6.01

Page 30: The Elephant in the Library - Integrating Hadoop

• Risk of format obsolescence • Quality assurance

• File format validation• Original/target image

comparison• Imagine runtime of 1 minute

per image for 200 million pages ...

Scenarios (III)File format migration

Page 31: The Elephant in the Library - Integrating Hadoop

Parallel execution of file format validation using Mapper●Jpylyzer (Python)●Jhove2 (Java)

Page 32: The Elephant in the Library - Integrating Hadoop

●Feature extraction requires sharing resources between processing steps

●Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison

Page 33: The Elephant in the Library - Integrating Hadoop

Scenarios (IV)Book page analysis

Page 34: The Elephant in the Library - Integrating Hadoop

Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API

Page 35: The Elephant in the Library - Integrating Hadoop

find

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

Page 36: The Elephant in the Library - Integrating Hadoop

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS

Page 37: The Elephant in the Library - Integrating Hadoop

find

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books24 Million pages

HtmlPathCreator SequenceFileCreatorSequenceFile creation

Page 38: The Elephant in the Library - Integrating Hadoop

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width

Page 39: The Elephant in the Library - Integrating Hadoop

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML Parsing

Page 40: The Elephant in the Library - Integrating Hadoop

Create Hive table and load generated data into the Hive database

Page 41: The Elephant in the Library - Integrating Hadoop

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

Page 42: The Elephant in the Library - Integrating Hadoop

: ~ 6 h60.000 books24 Million pages

HiveSelect

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001 2250 1870

Z119585409/00000002 2150 2100

Z119585409/00000003 2125 2015

Z119585409/00000004 2125 1350

Z119585409/00000005 2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic Queries

Page 43: The Elephant in the Library - Integrating Hadoop

Perform a simple Hive query to test if the database has been created successfully

Page 44: The Elephant in the Library - Integrating Hadoop

Outlook

“Progress generally appears much greater than it really is”

Johan Nestroy, 1847

Page 45: The Elephant in the Library - Integrating Hadoop

What have WE learned?

• We need to carefully assess the efforts for data preparation vs. the actual processing load

• HDFS prefers large files over many small ones,is basically “append-only”

• There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout

Page 46: The Elephant in the Library - Integrating Hadoop

What can YOU do?

• Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna(See http://www.scape-project.eu/events )

• Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable

• Follow us at @SCAPEProject and spread the word!

Page 47: The Elephant in the Library - Integrating Hadoop

What’s in it for US?

• Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere

• Ensuring our cultural history is not lost

• New innovative applications using cultural heritage data (education, creative industries)

Page 48: The Elephant in the Library - Integrating Hadoop

Thank you! Questions?(btw, we’re hiring)

www.kb.nlwww.onb.ac.at

www.scape-project.euwww.openplanetsfoundation.org