Top Banner
conversion of newspapers to digital objects, digital data preservation, and other interesting things Frederick Zarndt Chair, IFLA Newspapers Section [email protected] Tuesday, August 21, 12
134

20120822 conversion of historic newspapers to digital objects [russian state library]

Sep 12, 2014

Download

News & Politics

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20120822 conversion of historic newspapers to digital objects [russian state library]

conversion of newspapers to digital objects,

digital data preservation, and other interesting things

Frederick ZarndtChair, IFLA Newspapers Section

[email protected]

Tuesday, August 21, 12

Page 2: 20120822 conversion of historic newspapers to digital objects [russian state library]

sy act

kiaRIA W

eorNOR

RICH G so L eityCITY DE jr jolyjolV it jt 33 VOL 149 4 44 LON I1 I1 j 26 343

f

at nn

iA arms jgotichmonthsIi casitingleioleluleZ chiypy 15 centcents

1 one doz 12 121 21va i fj

i ertiervi SIN14 nilehild

dowdingft I1oa00Q ntahta

eachteach J f ir 1 t j r

leoi 00 ohrfo a haiirhale sittarefare er i

EEMIGRANTS iamitrotrwtralleTRALLyLLEARS91

1 1

jaujan havehare ii mig place ptof ii rival aiand de

Par1 tare insertedinsert1pd IMiii ali NEWS

ad L doP to their friendsgorfor 2521 centsTtenyenan bentscents for inserinterinsertiontien with

outtie pappapercram palli eisCIS of 20 andtind upwards

en 0ibd at oncee 20 cents eacheachichariyary additional information it

vetvei ts per iiilii1l e

INNEWSI1 I1 all skdelivaDELIVEREDRED ot the artpot 0W

ficco vill be open eachsabbath fromfrOM 12LI1 to I1 oclo qt I1ik NL

ENT8ansonNSON 0OAISA 1 lial NNorth0 e th kanKah yyonson611 i1 ANI1 EL ill 11 1 2 1 N cc

areARK acher countyJOEL II mill creekcwm tlioslidsBir cottoncottonwoodISAAC lIiill11 igneeoBEE utah

ulcaRicaApp sanSam peteezrae-znaEZRA T hi soxson tooeletoselo I1

bishopi HOLLADkiakriak and all the 53

acting0 Bishops1 in the cityunless subssubscribers us to

ihei contrary thoithol r jpaperserg igour nearest thirkeslbesidadB

n- tn- l I1

sodSOB preyPRENRMITING jj11

af fheaneflieile NE bacei 1 I I1 1 fr t rTT MAmailltfi Ty

to laveanve for the8stateswa aboutatlieitile ol01of july

S urgleI1 stiers id arthy artofariefart of theslatesstates 400 belitspapers mailed goinfram

catlban pay ov illi receil t 1 I1inn thot ro08statetatestatoss

grismGIAAU co10 RT

ouroar friends fefellowI1 0it ctcaienienlen andndelielj j grants are respect fulfui y infoingo rillcded that there willwiilill be a grandconcert4 iriha the Bbowery0 v ry oilon theevening ottinet I1

As tha people lovelore amliamusementwe design to gratify them I1 etli asseriesserlesen ca of COMcomicle ppiecesweesi andmost of which will be entirelyynewliew in this valley and Rsomeeinal got up expressly fofertile11 occa-sion for particulars see handbills I1

admittance bby tickets whichahielcan bo had at thetile tithing andI1post office gachdach

WM CLAYTONf GPUT T

GC 8 Lh city jliijuly lothlomm 18001860

ng 414 11E

ALL djerfpersonsdersonsons talat havehavaof I1 q I1

cattlecattie 0 honherseghorsesseg residing in thevalloyvalley are titeiliepherebyreby notified thatwhenwiron thothoy trade the same with

oior ofothersliets the law re-quiredq illlilmograg thenthena

1

to 1 resha4 1

rilrik14 i ibufay quiane 6ordinalPblitch thetho vend

Til fttil uilull notnat if MD totj PL ASE

1

t dcloill3 ntn 1 a tn 10 Atiitri ih few is thatthitth tYNnrr I1 nienuhnih fph

jfe1 A hlinnian buxhux iati hibaunor lidy ilnhk SC va naj PIA nin h 3 aaaardar I1folfotC enofuroA bihy t 9 aan ou iiii jA min pfaf 1 Ac timaim t ilelloand ye 1 etC tantd flit L ve a fool ljA s printyferinnriun I1 walaal rt take j

I1 A e ariiarriiili11 hp Anajtmajl d cc lilIii tle tla I1 last s IVbat iaw

k olis it 0qSratevaterHkalall thies I1 i if hatehiteafiaandjeyea 1I luve

DAMIGIrl s Ttolo10gardelgardeigardensis emoemmcrodestroyed on tead 4 night byemcymigrantsmgrantsrantscscattieattlecattle bichaich cocostt them74 our marlmarfliloirim s gusts ahathat

it would bobe iviswisdomdonidont faforr the biniennigranis to campcanid tarther froincroul tiiethocity thereby savia their moneyand leavinleaving thetietle vevegetableseatables togrow

SAN PETE several brethrenarrived fromfroin san pete tuesdayorionbringing0 340 M shingles11 and re-port all well crops lutelate but pros-perous1 0 is

jjrwec arc informed that estellestelico of weston mo are runningamail from moblo to pacific springsM I1accommodating all travellerstravellers oilonthe route at 50 acts per letterthetile council of health meet on

wednesday adviceadv ice gratisratnat is fromafrota 3to 14 P MA atodroveve of cows passed our 0of-

fice 6 day ejieai outdoor7 gali-ledlorniatornia s

we tiletheemgratson is aboutabout 2 weeks bichtof this point

A whowiio undertakes to tweTWO

jg adaa xI1 aw other 9 mgtamasas sitbit iownlawndowna W 11jarroW rajanj UunArtsrtsedio ivtohimselfhin seifself

why digitize newspapers?

Tuesday, August 21, 12

Page 3: 20120822 conversion of historic newspapers to digital objects [russian state library]

Photo by DAVID ILIFF. License: CC-BY-SA 3.0

reading rooms by the numbers

Monthly averageMonthly averageMonthly average

VisitorsRequests for NewspapersRequests for Newspapers

Population Reading Room Microform Print

Australia 22,876,000 5,130 345 240

France 65,350,000 3,000 2,000 1,000

Netherlands 16,847,000 NA NA NA

New Zealand 4,414,000 NA NA NA

Norway 4,985,000 600 400 NA

Singapore 5,184,000 NA 300 NA

UK 62,262,000 2,000 6,900 4,816

USA 313,292,000 NA NA NA

Tuesday, August 21, 12

Page 4: 20120822 conversion of historic newspapers to digital objects [russian state library]

digitised newspapersby the numbers

Monthly averageMonthly averageMonthly averageDigitised Historical NewspapersDigitised Historical NewspapersDigitised Historical Newspapers

Population Unique Visitors Genealogist Other User Age

22,876,000 150,000 50% 50% >55

37,692,000 12,800 65% 35% >50

5,405,000 NA NA NA ?

65,350,000 22,000 NA NA ?

16,847,000 50,000 NA NA ?

4,414,000 83,333 50% NA >50

4,985,000 1,500 NA NA ?

5,184,000 12,400 NA NA ?

62,262,000 NA NA NA ?

313,292,000 NA NA NA ?

Tuesday, August 21, 12

Page 5: 20120822 conversion of historic newspapers to digital objects [russian state library]

physical versus digital

Monthly averageMonthly average

Requests for Newspapers Digitised Historical NewspapersPopulation Paper + Microform Unique Visitors

22,876,000 585 150,000

37,692,000 NA 12,800

5,405,000 NA NA

65,350,000 3,000 22,000

16,847,000 NA 50,000

4,414,000 NA 83,333

4,985,000 400 1,500

5,184,000 300 12,400

62,262,000 11,716 NA

313,292,000 NA NA

Tuesday, August 21, 12

Page 6: 20120822 conversion of historic newspapers to digital objects [russian state library]

more numbers!Monthly averageMonthly averageMonthly average

CollectionCollection Digitised Historical NewspapersDigitised Historical NewspapersDigitised Historical Newspapers

Population Name ~Size [pages] Unique Visitors Genealogist OtherLines

Corrected User Age

22,876,000 Trove 5,000,000 150,000 50% 50% 220,000 >55

37,692,000 CDNC 495,000 12,800 65% 35% 31,000 >50

5,405,000 Historical Newspaper Library 2,000,000 NA NA NA NA ?

65,350,000 Gallica 2,200,000 22,000 NA NA NA ?

16,847,000 Historische Kranten 5,000,000 50,000 NA NA NA ?

4,414,000 Papers Past 2,213,000 83,333 50% NA NA >50

4,985,000 NBDigital Aviser 8,100,000 1,500 NA NA NA ?

5,184,000 Newspaper SG 2,400,000 12,400 NA NA NA ?

62,262,000 British Newspaper Archive 4,880,000 NA NA NA NA ?

313,292,000 Chronicling America 4,100,000 NA NA NA NA ?

Tuesday, August 21, 12

Page 7: 20120822 conversion of historic newspapers to digital objects [russian state library]

what is Alexa?• Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is

the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon.

• Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat.

• Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine.

• Alexa continually crawls all publicly-available websites to create a series of snapshots of the web.

• Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide).

Tuesday, August 21, 12

Page 8: 20120822 conversion of historic newspapers to digital objects [russian state library]

definitions

• A PageView is a request for a file whose type is defined as a page.

• A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale.

• A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request.

• Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between.

• World | Country Rank is a function of the average daily unique visits and the number of unique pages requested.

definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics

Tuesday, August 21, 12

Page 9: 20120822 conversion of historic newspapers to digital objects [russian state library]

Alexa ranking world viewAlexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012

Population WebsiteWorld rank[Lo is good]

313,292,000 http://www.loc.gov/index.html/ 3,122

22,876,000 http://trove.nla.gov.au/ 16,700

65,350,000 http://www.bnf.fr/ 17,096

62,262,000 http://www.bl.uk/ 27,079

4,414,000 http://www.natlib.govt.nz/ 123,976

62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259

16,847,000 http://www.kb.nl/ 155,363

5,184,000 http://www.nl.sg/ 156,610

4,985,000 http://www.nb.no/ 189,940

5,405,000 http://www.nationallibrary.fi/ 3,212,803

Tuesday, August 21, 12

Page 10: 20120822 conversion of historic newspapers to digital objects [russian state library]

Alexa ranking country viewAlexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012

Population WebsiteWorld rank[Lo is good]

Country rank[Lo is good]

5,405,000 http://www.nationallibrary.fi/ 3,212,803 199

22,876,000 http://www.nla.gov.au/ 16,700 375

4,414,000 http://www.natlib.govt.nz/ 123,976 515

65,350,000 http://www.bnf.fr/ 17,096 727

4,985,000 http://www.nb.no/ 189,940 891

313,292,000 http://www.loc.gov/index.html/ 3,122 1,011

5,184,000 http://www.nl.sg/ 156,610 1,208

62,262,000 http://www.bl.uk/ 27,079 2,245

16,847,000 http://www.kb.nl/ 155,363 3,450

62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259 15,692

Tuesday, August 21, 12

Page 11: 20120822 conversion of historic newspapers to digital objects [russian state library]

Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012

PopulationWorld rank[Lo is good]

Country rank[Lo is good] Where visitors go [sub-domain]Where visitors go [sub-domain]

5,405,000 3,212,803 199 NA NA

22,876,000 16,700 375 http://trove.nla.gov.au/ 57.2%

4,414,000 123,976 515 http://paperspast.natlib.govt.nz/ 50.9%

65,350,000 17,096 727 http://gallica.bnf.fr/ 52.0%

4,985,000 189,940 891 NA NA

313,292,000 3,122 1,011 http://chroniclingamerica.loc.gov/ 4.8%

5,184,000 156,610 1,208 http://newspapers.nl.sg/ 28.0%

62,262,000 27,079 2,245 http://newspapers11.bl.uk/blcs/ 2.5%

16,847,000 155,363 3,450 http://kranten.kb.nl/ 22.4%

62,262,000 155,259 15,692 NA NA

where visitors go

Tuesday, August 21, 12

Page 12: 20120822 conversion of historic newspapers to digital objects [russian state library]

lots of numbers(sorted by time on site)

Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012

WebsiteSpeed

[Hi is good]Bounce rate[Lo is good]

Reputation[Hi is good]

Page views per visitor

[Hi is good]Time on site[Hi is good]

http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s

http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s

http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s

http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s

http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s

http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s

http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s

http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s

http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s

http://www.nl.sg/ 72% 65% 802 2.0 2m 4s

Tuesday, August 21, 12

Page 13: 20120822 conversion of historic newspapers to digital objects [russian state library]

Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012Alexa 3 month trailing averages 2-Apr-2012

WebsiteSpeed

[Hi is good]Bounce rate[Lo is good]

Reputation[Hi is good]

Page views per visitor

[Hi is good]Time on site[Hi is good]

http://www.ancestry.com/ 32% 24% 20,055 29.9 23m 54s

http://www.familysearch.org/ 50% 18% 9,832 15.8 16m 19s

http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s

http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s

http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s

http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s

http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s

http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s

http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s

http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s

http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s

http://www.nl.sg/ 72% 65% 802 2.0 2m 4s

even more numbers(sorted by time on site)

Tuesday, August 21, 12

Page 14: 20120822 conversion of historic newspapers to digital objects [russian state library]

digital newspapers enable broader, easier, and faster access

why digitize newspaper collections?

Tuesday, August 21, 12

Page 15: 20120822 conversion of historic newspapers to digital objects [russian state library]

considerations in newspaper digitization

Tuesday, August 21, 12

Page 16: 20120822 conversion of historic newspapers to digital objects [russian state library]

selection criteria

• importance of title• complete (no missing issues)• temporal coverage• research value• quality / fragility of original documents• quality of microfilm• etc (other local criteria)

Tuesday, August 21, 12

Page 17: 20120822 conversion of historic newspapers to digital objects [russian state library]

page-level versus article-level newspaper digitization

cost production difficulty

copyright management usability accessibility

page-level $ easy usually simple low good

article-level $$$ hard usually complex excellent excellent

Tuesday, August 21, 12

Page 18: 20120822 conversion of historic newspapers to digital objects [russian state library]

preservation, access, administrationOpen Archival Information System

(OAIS) reference model

Tuesday, August 21, 12

Page 19: 20120822 conversion of historic newspapers to digital objects [russian state library]

accessimages

the digitization process

image productiontext objectsdigitization

magic

ingestpreservation

access

Tuesday, August 21, 12

Page 20: 20120822 conversion of historic newspapers to digital objects [russian state library]

• image file formats• TIFF• JPEG2000• JPEG• GIF

• text file formats• PDF, PDF/A, PDF/A-1b, PDF/A-1a• TEI XML• HTML• plain text• NITF / NewsML

• metadata• METS• MODS / PREMIS / ALTO / MIX ...

standard file formats

Tuesday, August 21, 12

Page 21: 20120822 conversion of historic newspapers to digital objects [russian state library]

• image production source materials• original documents: better quality,

more expensive• microfiche: poorer quality, less

expensive, microfiche quality varies• bit depth

• black-and-white (bitonal)• greyscale• color

• resolution• compression

• no compression• lossless (reversible)• lossy (irreversible)

• image metadata

image decisions

? ¿Tuesday, August 21, 12

Page 22: 20120822 conversion of historic newspapers to digital objects [russian state library]

image format comparison

Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)

compression bit depth metadata color management

mimetype patent 1st public

release

JBIG(.jbig, .jbg)

lossless 1-bit no no 2000?

JPEG(.jpg, .jpeg)

lossy, DCT, RLE, Huffman

8-bit12-bit24-bit

yes yes image/jpegpublic.jpeg no 1992

JPEG2000(.jp2)

many lossless and lossy compression

algorithms

8-bit16-bit

color to 48 bits

yes yesimage/jp2

public.jpeg200

yes butpart 1 is patent

free

2000

TIFF(.tiff, .tif)

noneLZWRLEZIP

Other

1, 2, 4, 8, 16, 24, 32 bits

yes yes image/tiffpublic.tiff no 1986

Tuesday, August 21, 12

Page 23: 20120822 conversion of historic newspapers to digital objects [russian state library]

• METS XML for descriptive, structural, technical, and administrative metadata

• descriptive metadata• Metadata Object Description Standard (MODS)

selected metadata from MARC• Dublin Core fundamental group of text elements for

describing and cataloging

• technical metadata• ALTO for OCR text• PREMIS for digital preservation• MIX and ANSI/NISO Z39.87 for images

digital library standards

Tuesday, August 21, 12

Page 24: 20120822 conversion of historic newspapers to digital objects [russian state library]

Metadata Encoding and Transmission Standard

• METS is a XML standard for encoding descriptive, administrative, and structural metadata about objects within a digital library

• METS files consist of 7 (optional) sections: header, descriptive, administrative, file map, structural map, structural link, and behavior

• METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile

• current version 1.9.1• administered by METS editorial board (international group of

volunteers)• standards hosted by Library of Congress at http://www.loc.gov/

standards/mets/

Tuesday, August 21, 12

Page 25: 20120822 conversion of historic newspapers to digital objects [russian state library]

Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.

METS file structure

Tuesday, August 21, 12

Page 26: 20120822 conversion of historic newspapers to digital objects [russian state library]

Metadata Object Description Schema

• MODS is an XML schema for a bibliographic element set that may be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones

• Subset of MARC 21• Mappings exist between MODS and MARC, Dublin Core, and

RDA (conversion tools exist)• May be used in conjunction with METS XML• current version 3.4• administered by Library of Congress Network Development and

MARC Standards Office with help from interested users• standards hosted by Library of Congress at http://www.loc.gov/

standards/mods/

Tuesday, August 21, 12

Page 27: 20120822 conversion of historic newspapers to digital objects [russian state library]

MODS metadata in METS XML<mets:dmdSec ID="issue-nla.news-issn18368190_18740425">! <mets:mdWrap MDTYPE="MODS">! ! <mets:xmlData>! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3">! ! ! ! <mods:language>! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm>! ! ! ! </mods:language>! ! ! ! <mods:genre>newspaper issue</mods:genre>! ! ! ! <mods:originInfo>! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued>! ! ! ! </mods:originInfo>! ! ! ! <mods:relatedItem type="host">! ! ! ! ! <mods:titleInfo>! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title>! ! ! ! ! </mods:titleInfo>! ! ! ! ! <mods:genre>newspaper</mods:genre>! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="volume">! ! ! ! ! ! ! <mods:number>IX</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="issue">! ! ! ! ! ! ! <mods:number>12</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! </mods:relatedItem>! ! ! </mods:mods>! ! </mets:xmlData>! </mets:mdWrap></mets:dmdSec>

Tuesday, August 21, 12

Page 28: 20120822 conversion of historic newspapers to digital objects [russian state library]

Dublin Core metadata

• Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery.

• Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85

• Metadata terms last updated 14-Jun-2012• May be used in conjunction with METS XML• Dublin Core Metadata Initiative (DCMI) is an open

organization, incorporated as a public, not-for-profit company in Singapore

• Dublin Core Metadata Initiative is hosted at http://dublincore.org/

Tuesday, August 21, 12

Page 29: 20120822 conversion of historic newspapers to digital objects [russian state library]

Analyzed Layout and Text Object

• ALTO XML provides technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper

• commonly used in conjunction with METS XML but may be used standalone

• current version 2.0• administered by ALTO editorial board (international group of

volunteers)• standards hosted by Library of Congress at http://www.loc.gov/

standards/alto/

Tuesday, August 21, 12

Page 30: 20120822 conversion of historic newspapers to digital objects [russian state library]

<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"><Description>! <MeasurementUnit>pixel</MeasurementUnit>! <sourceImageInformation>! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName>! </sourceImageInformation></Description><Styles>! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/>! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles><Layout>! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967">! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/>! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/>! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/>! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/>! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194">! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831">! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831">! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT">! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75">! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/>! ! ! ! ! ! ! <SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/>! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/>! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/>! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46">! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/>! ! ! ! ! ! ! <SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/>! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! </ComposedBlock> ! </PrintSpace> </Page></Layout></alto>

Analyzed Layout and Text Object

Tuesday, August 21, 12

Page 31: 20120822 conversion of historic newspapers to digital objects [russian state library]

Analyzed Layout and Text Objectbook

Tuesday, August 21, 12

Page 32: 20120822 conversion of historic newspapers to digital objects [russian state library]

Analyzed Layout and Text Objectnewspaper

Tuesday, August 21, 12

Page 33: 20120822 conversion of historic newspapers to digital objects [russian state library]

Preservation Metadata Implementation Strategies

• PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use

• In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities

• PREMIS data dictionary current version 2.2• May be used in conjunction with METS XML• PREMIS tools are freely available• PREMIS Maintenance Activity and Editorial Committee has

international members from libraries and industry• PREMIS data dictionary is hosted at http://www.loc.gov/

standards/premis/

Tuesday, August 21, 12

Page 34: 20120822 conversion of historic newspapers to digital objects [russian state library]

PREMIS data in METS file

<mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD>

</mets:amdSec>

Tuesday, August 21, 12

Page 35: 20120822 conversion of historic newspapers to digital objects [russian state library]

Tuesday, August 21, 12

Page 36: 20120822 conversion of historic newspapers to digital objects [russian state library]

accessimages

the digitization process

image productiontext objectsdigitization

magic

ingestpreservation

access

Tuesday, August 21, 12

Page 37: 20120822 conversion of historic newspapers to digital objects [russian state library]

images

digitization magic

objectsdigitization magic

Tuesday, August 21, 12

Page 38: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

Tuesday, August 21, 12

Page 39: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

• crop, de-skew, split images• apply image improvement algorithms as needed

• sharpening filters• local adaptive thresholding• remove text bleed-thru• etc

• create master images• create working images

Tuesday, August 21, 12

Page 40: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

• analyze layout of text image• estimate font types and sizes• calculate coordinates of text blocks• determine layout object types (text,

illustration, headline, etc)

Tuesday, August 21, 12

Page 41: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

• perform optical character recognition (OCR)• calculate word and character coordinates• calculate word and character confidences• apply language dictionaries• correct OCR text (optional)

Tuesday, August 21, 12

Page 42: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

• populate metadata fields• verify / correct page numbers• verify / correct document structure

Tuesday, August 21, 12

Page 43: 20120822 conversion of historic newspapers to digital objects [russian state library]

objects

digitization magic

imagesimage

processinglayout

analysis OCR metadatabuild

digital objects

• create METS / ALTO XML files• create image files and image metadata• create PDF files (if required)• verify digital object• calculate file fixity checks (checksums)• perform file validation and verification• perform quality assurance

Tuesday, August 21, 12

Page 44: 20120822 conversion of historic newspapers to digital objects [russian state library]

• automatic production steps performed by software

• manual production steps performed by operators

real world digitization production workflow

Tuesday, August 21, 12

Page 45: 20120822 conversion of historic newspapers to digital objects [russian state library]

newspaper digitization programs around the world

Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/)

Bibliotheque nationale de France (http://gallica.bnf.fr/)

National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper)

Singapore National Library Board (http://newspapers.nl.sg/)

National Library of New Zealand (http://paperspast.natlib.govt.nz/)

National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/)

British Newspaper Archives, British Library (http://www.bl.uk/welcome/newspapers)

Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/)

National Library of Finland (http://digi.kansalliskirjasto.fi/)

National Library of Latvia (https://periodika.lndb.lv/)

Tuesday, August 21, 12

Page 46: 20120822 conversion of historic newspapers to digital objects [russian state library]

image references and recommendations• Ian Bogus et al. Minimum Digitization Capture Recommendations (draft). The Association

for Library Collections and Technical Services. June 2012 (accessed 18 Aug, 2012 at http://connect.ala.org/node/185648).

• Robert Buckley and Simon Tanner. JPEG 2000 as a Preservation and Access Format for the Wellcome Trust Digital Library. Xerox Corporation and King’s College Digital Consultancy for the Wellcome Trust Library. August 2009 (accessed 1 July 2012 at http://library.wellcome.ac.uk/assets/wtx056572.pdf).

• Paolo Buonora and Franco Liberati. A Format for Digital Preservation of Images: A Study on JPEG 2000 File Robustness. D-Lib Magazine. July/August 2008. (accessed 1 July 2012 at http://www.dlib.org/dlib/july08/buonora/07buonora.html).

• ANSI/NISO Z39.87-2006. Data Dictionary -- Technical Metadata for Digital Still Images. National Information Standards Organization, Bethesda, Maryland USA. December 2006. (accessed 1 August 2012 at http://www.niso.org/apps/group_public/download.php/6502/Data%20Dictionary%20-%20Technical%20Metadata%20for%20Digital%20Still%20Images.pdf).

• JBIG Standard (accessed 1 August 2012 at http://www.jpeg.org/jbig).• JPEG Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg).• JPEG2000 Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg2000/).• TIFF 6.0 Standard (accessed 1 August 2012 at http://partners.adobe.com/public/

developer/tiff).• Many, many others....

Tuesday, August 21, 12

Page 47: 20120822 conversion of historic newspapers to digital objects [russian state library]

newspaper digitisation references

Koninklijke Bibliotheek Historische Kranten (the Netherlands) http://kranten.kb.nl/about

Australian Newspapers Digitisation Program https://www.nla.gov.au/ndp/

IFLA Newspapers Section http://www.ifla.org/en/newspapers

Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/

IMPACT Centre of Competence http://www.digitisation.eu/

Europeana Newspapers http://www.europeana-newspapers.eu/

Tuesday, August 21, 12

Page 48: 20120822 conversion of historic newspapers to digital objects [russian state library]

http://bit.ly/russianperiodicals

Try crowdsourcing when you visit the URL above!

Russian language periodicalsMETS/ALTO XML with JPEG2000 images

Learn more about the software and crowdsourcing at http://www.dlconsulting.com.Tuesday, August 21, 12

Page 49: 20120822 conversion of historic newspapers to digital objects [russian state library]

?2

Tuesday, August 21, 12

Page 50: 20120822 conversion of historic newspapers to digital objects [russian state library]

Part 2Short and simple:

An overview of digital preservation

Tuesday, August 21, 12

Page 51: 20120822 conversion of historic newspapers to digital objects [russian state library]

digital preservationPreservation of software and preservation of data are two sides of

the same coin. From February 2011 Workshop for Digital Curators.

Tuesday, August 21, 12

Page 52: 20120822 conversion of historic newspapers to digital objects [russian state library]

preservationOpen Archival Information System

(OAIS) reference model

Tuesday, August 21, 12

Page 53: 20120822 conversion of historic newspapers to digital objects [russian state library]

digitization

Tuesday, August 21, 12

Page 54: 20120822 conversion of historic newspapers to digital objects [russian state library]

digitization

digital preservation

Tuesday, August 21, 12

Page 55: 20120822 conversion of historic newspapers to digital objects [russian state library]

digitization

digital preservation

Tuesday, August 21, 12

Page 56: 20120822 conversion of historic newspapers to digital objects [russian state library]

digitization

digital preservation

≠!

Tuesday, August 21, 12

Page 57: 20120822 conversion of historic newspapers to digital objects [russian state library]

Vint Cerf on “bit rot”

Tuesday, August 21, 12

Page 58: 20120822 conversion of historic newspapers to digital objects [russian state library]

digital preservation

long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time

span the information is required

Tuesday, August 21, 12

Page 59: 20120822 conversion of historic newspapers to digital objects [russian state library]

tolerance for downtime?tolerance for data loss?

• 99.999% availability required?• length of downtime tolerated?

• what is the value of the data?• is the data reproducible? at what cost?• what is the mean time to data loss (MTTDL)?• what is

Tuesday, August 21, 12

Page 60: 20120822 conversion of historic newspapers to digital objects [russian state library]

availability threats• communications failure• internet attacks / vandalism• hardware failure• software failure• power failure• natural disaster• etc ...

Tuesday, August 21, 12

Page 61: 20120822 conversion of historic newspapers to digital objects [russian state library]

communication failure

redundant, multiple communications channels from

independent providers

Tuesday, August 21, 12

Page 62: 20120822 conversion of historic newspapers to digital objects [russian state library]

• denial of service• viruses, worms• data vandalism• website vandalism

internet attacks / vandalism

Tuesday, August 21, 12

Page 63: 20120822 conversion of historic newspapers to digital objects [russian state library]

hardware failure

• hot standby redundant hardware• cold standby redundant hardware• backup and restore

Tuesday, August 21, 12

Page 64: 20120822 conversion of historic newspapers to digital objects [russian state library]

software failure

• rollback to known working software (some downtime)• known working software on

standby redundant hardware (little downtime)• backup and restore (significant

downtime)

Tuesday, August 21, 12

Page 65: 20120822 conversion of historic newspapers to digital objects [russian state library]

power failure

uninterruptible power supply

Tuesday, August 21, 12

Page 66: 20120822 conversion of historic newspapers to digital objects [russian state library]

natural disaster

• alternate data center• backup and restore

Tuesday, August 21, 12

Page 67: 20120822 conversion of historic newspapers to digital objects [russian state library]

digital data risks

• standards / format obsolescence• migration to new format, media,

or hardware• media obsolescence / decay• bit rot

Tuesday, August 21, 12

Page 68: 20120822 conversion of historic newspapers to digital objects [russian state library]

format obsolescence

remember … WordPerfect ?

MARC records ? Adobe Flash ?

Tuesday, August 21, 12

Page 69: 20120822 conversion of historic newspapers to digital objects [russian state library]

strategies forformat obsolescence

•migrate data to new formats• create a computer software museum

with virtual machines• format registries• format validators• don’t worry about it!

Tuesday, August 21, 12

Page 70: 20120822 conversion of historic newspapers to digital objects [russian state library]

Jeff Rothenberg onformat obsolescence

“... digital documents are evolving sorapidly that shifts in the forms of documents

must inevitably arise. New forms do not necessarily subsume their predecessors or

provide compatibility with previous formats.”

Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

Tuesday, August 21, 12

Page 71: 20120822 conversion of historic newspapers to digital objects [russian state library]

standard modelfor format obsolescence

• digital format registry collects information about target format• this information is used to build format identification and

verification tools• holders of content use these tools to extract metadata from

content in target format; metadata is stored with the content• format registry scans computing environment to determine which

formats are obsolescent; notifications sent for obsolete formats• on receiving such a notification, someone builds a tool to convert

obsolete format to non-obsolete format using the format specification in the registry

• on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete format

Tuesday, August 21, 12

Page 72: 20120822 conversion of historic newspapers to digital objects [russian state library]

David Rosenthal onformat obsolescence

“... format obsolescence is a rare problem that happens infrequently to a minority of

unpopular formats ...”

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf

Tuesday, August 21, 12

Page 73: 20120822 conversion of historic newspapers to digital objects [russian state library]

alternate modelfor format obsolescence

• store only essential data • perform only essential tasks• delay performing tasks as long as possible

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

Tuesday, August 21, 12

Page 74: 20120822 conversion of historic newspapers to digital objects [russian state library]

importance of standardsvis-a-vis format obsolescence

well-defined standards ...• guide developers in creation of tools• facilitates development of a broad range of

tools for any format• allow developers to maintain existing tools

Tuesday, August 21, 12

Page 75: 20120822 conversion of historic newspapers to digital objects [russian state library]

data migration risks

• file format changes, for example, PDF 1.4 to PDF 1.8• file name differences, for example, case

sensitive /insensitive names, new operating system• extended file attributes• file permissions, for example, BSD Unix

drwxr-xr-x@ to Windows file permissions• soft links / hard links

Tuesday, August 21, 12

Page 76: 20120822 conversion of historic newspapers to digital objects [russian state library]

media obsolescence

• 5 ¼” floppy disks• 8 track tapes• 3 ½” floppy disks• ZIP drives• CD-R, CD-RW, Blu-Ray• DAT tapes• microfilm• etc

Tuesday, August 21, 12

Page 77: 20120822 conversion of historic newspapers to digital objects [russian state library]

strategies formedia obsolescence

• migrate data to new media, for example, floppy disks to DVD• create and maintain a computer hardware

museum

Tuesday, August 21, 12

Page 78: 20120822 conversion of historic newspapers to digital objects [russian state library]

media decay

a report by NIST and the Library of Congress says ...

• virtually all CD-Rs tested indicated an estimated life expectancy beyond 15 years

• only 47 percent of recordable DVDs indicated an estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years

• in practice actual lifetimes may be considerably shorter

Tuesday, August 21, 12

Page 79: 20120822 conversion of historic newspapers to digital objects [russian state library]

• proper storage• data file checksums (MD5, SHA-1, ...)• monitor media integrity• migrate data from old media to new

media

prevention / detectionof media decay

Tuesday, August 21, 12

Page 80: 20120822 conversion of historic newspapers to digital objects [russian state library]

bit rot

gradual decay of data due to …

• storage media failure because of media quality• storage media failure because of improper storage• random events (bit-flip, environmental

influences)• software / hardware errors

Tuesday, August 21, 12

Page 81: 20120822 conversion of historic newspapers to digital objects [russian state library]

prevention / detectionof bit rot

• data file fixity check (checksums) such as MD5, SHA-1, ...

• monitor file integrity with frequent, corrective audits

• duplicate copies, geographically distributed

Tuesday, August 21, 12

Page 82: 20120822 conversion of historic newspapers to digital objects [russian state library]

distributed decentralized preservation

• the more copies, the safer the data• the more independent copies, the safer

the data• the more frequently copies are audited,

the safer the data

Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?

Tuesday, August 21, 12

Page 83: 20120822 conversion of historic newspapers to digital objects [russian state library]

distributed decentralized preservation

• n+1 copies are safer than n copies• n independent copies on different storage

devices / media are safer than n copies on similar or identical storage devices / media

• data audited every week is safer than data audited every month

Tuesday, August 21, 12

Page 84: 20120822 conversion of historic newspapers to digital objects [russian state library]

LOCKSSLots Of Copies Keep Stuff Safe

• It ingests content from target websites using a web crawler similar to those used by search engines.

• It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences.

• It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available.

• It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content.

• It dynamically migrates content to new formats as needed for display.

From LOCKSS webpages http://www.lockss.org.

LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine.

Tuesday, August 21, 12

Page 85: 20120822 conversion of historic newspapers to digital objects [russian state library]

how LOCKSS worksdata copied to another LOCKSS box

library XLOCKSS

box

library YLOCKSS

box

my libraryLOCKSS

box

data

Tuesday, August 21, 12

Page 86: 20120822 conversion of historic newspapers to digital objects [russian state library]

how LOCKSS worksdata audited

library XLOCKSS

box

library YLOCKSS

box

my libraryLOCKSS

box

dataaudit

Tuesday, August 21, 12

Page 87: 20120822 conversion of historic newspapers to digital objects [russian state library]

how LOCKSS worksdata audited

library XLOCKSS

box

library YLOCKSS

box

my libraryLOCKSS

box

dataaudit

audit fails

audit  ok

Tuesday, August 21, 12

Page 88: 20120822 conversion of historic newspapers to digital objects [russian state library]

how LOCKSS worksdata copied to another LOCKSS box

library XLOCKSS

box

library YLOCKSS

box

my libraryLOCKSS

box

data

Tuesday, August 21, 12

Page 89: 20120822 conversion of historic newspapers to digital objects [russian state library]

private LOCKSS networks

Alabama Digital Preservation Network (http://www.adpn.org/).

CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS (http://www.clockss.org).

MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (http://www.metaarchive.org).

• Many others...

Tuesday, August 21, 12

Page 90: 20120822 conversion of historic newspapers to digital objects [russian state library]

digital preservation references

• Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/default/files/ANADP_Educopia_2012.pdf).

• David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

• David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ACM2010.pdf).

• Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

• Joint Information Systems Committee (JISC) Programme on Digital Preservation at http://www.jisc.ac.uk/preservation.

• Library of Congress on Digital Preservation at http://www.digitalpreservation.gov.• Stanford University’s website for LOCKSS at http://www.lockss.org.

Tuesday, August 21, 12

Page 91: 20120822 conversion of historic newspapers to digital objects [russian state library]

2

?Tuesday, August 21, 12

Page 92: 20120822 conversion of historic newspapers to digital objects [russian state library]

Part 3The importance of communication, specifications,

acceptance criteria

Tuesday, August 21, 12

Page 93: 20120822 conversion of historic newspapers to digital objects [russian state library]

the problem

Wise men learn by other men's mistakes, fools by their own.H. G. Wells

Tuesday, August 21, 12

Page 94: 20120822 conversion of historic newspapers to digital objects [russian state library]

the 2009 CHAOS Report (The Standish Group) reports that of all software projects surveyed, 44% are

“challenged”, 24% failed, and only 32% succeeded

the problem

Tuesday, August 21, 12

Page 95: 20120822 conversion of historic newspapers to digital objects [russian state library]

Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month

Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects.

the problem

Tuesday, August 21, 12

Page 96: 20120822 conversion of historic newspapers to digital objects [russian state library]

in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the 3 biggest project challenges cited by the IT pros are “poor

planning” and “poor or no requirements”

the problem

Tuesday, August 21, 12

Page 97: 20120822 conversion of historic newspapers to digital objects [russian state library]

in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly 28 percent of the

more than 1,000 respondents singled out poor communications as the number one cause of project

failure"

the problem

Tuesday, August 21, 12

Page 98: 20120822 conversion of historic newspapers to digital objects [russian state library]

in a white paper written for Project Perfect by Taimour al Neimat, he lists

• poor planning• unclear goals and objectives• objectives changing during the project• unrealistic time or resource estimates• lack of executive support and user involvement• failure to communicate and act as a team• inappropriate skills

as primary causes for the failure of complex IT projects

the problem

Tuesday, August 21, 12

Page 99: 20120822 conversion of historic newspapers to digital objects [russian state library]

a recent tender from an (anonymous) government agency

• project to convert ~ 170,000 text images to xml

• value of project ~ USD $180,000

• 19 pages of definitions, governing law, proposal evaluation criteria, contractual conditions, instructions about tender response format, etc

• technical requirements description? < 1 page

• data acceptance criteria? “a high level of accuracy”

the problem

Tuesday, August 21, 12

Page 100: 20120822 conversion of historic newspapers to digital objects [russian state library]

a recent program established by a prominent national library• digitize more than 20 million text pages • high level image and xml requirements• value of work awarded? > USD $5,000,000• after award of work, METS xml technical requirements

expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not

well defined

the problem

Tuesday, August 21, 12

Page 101: 20120822 conversion of historic newspapers to digital objects [russian state library]

acceptance criteria for a digitization program at a prominent library

character accuracy > 80%word accuracy > 75%significant word accuracy > 65%

the problem

Tuesday, August 21, 12

Page 102: 20120822 conversion of historic newspapers to digital objects [russian state library]

typical tender evaluation criteria in priority order

1. understanding of requirements 2. reputation of service bureau3. price

the problem

Tuesday, August 21, 12

Page 103: 20120822 conversion of historic newspapers to digital objects [russian state library]

Tuesday, August 21, 12

Page 104: 20120822 conversion of historic newspapers to digital objects [russian state library]

communication acceptance

requirements

the problem

Tuesday, August 21, 12

Page 105: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusion

In theory, there's no difference between theory and practice, but in practice, there is.

Anonymous

The single biggest problem in communication is the illusion it has taken place.

George Bernard Shaw

Tuesday, August 21, 12

Page 106: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusionwaterfall requirements

for each product release repeat{

gather requirementscreate architecturedesignimplementtestuse -or- sell

}until (company goes out of business)

Tuesday, August 21, 12

Page 107: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusionrequirements

a recent tender from an (anonymous) government agency• project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000• 19 pages of definitions, governing law, proposal

evaluation criteria, contractual conditions, instructions about tender response format, etc• technical requirements description? < 1 page• data acceptance criteria? “a high level of accuracy”

Tuesday, August 21, 12

Page 108: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusionacceptance criteria

acceptance criteria for a digitization program at a large, well-known, and internationally recognized national library

character accuracy > 80%word accuracy > 75%significant word accuracy > 65%

Tuesday, August 21, 12

Page 109: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusionwhy (better) communication is necessary

Copyright United Media. Used with permission.Tuesday, August 21, 12

Page 110: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixExperience is that marvelous thing that enables you to recognize a

mistake when you make it again.F. P. Jones

Tuesday, August 21, 12

Page 111: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixvalue of simplicity

“Perfection is attained, not when there is nothing left to add, but when there is nothing left to take away.”

Antoine de St. Exupery

Tuesday, August 21, 12

Page 112: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixvalue of prototypes and pilot batches

“Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first implementation will have to be redone completely to

achieve a satisfactory (i.e., acceptably small, fast, and maintainable) result. It costs a lot less if you plan to have

a prototype.”

Butler Lampson

Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow.

Tuesday, August 21, 12

Page 113: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixvalue of simplicity

“There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated

that there are no obvious deficiencies.” C.A.R. Hoare

Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software.

Tuesday, August 21, 12

Page 114: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixgood requirements

• unitary: the requirement addresses one and only one thing• complete: the requirement is fully stated in one place with no

missing information• consistent: the requirement does not contradict any other

requirement and is fully consistent with all authoritative external documentation• atomic: it does not contain conjunctions, for example, "the

code field must validate American and Canadian postal codes" should be written as two separate requirements• traceable: the requirement meets all or part of a business need

as stated by stakeholders and authoritatively documented

Tuesday, August 21, 12

Page 115: 20120822 conversion of historic newspapers to digital objects [russian state library]

• current: the requirement has not been made obsolete by the passage of time• feasible: the requirement can be implemented within the

constraints of the project• unambiguous: the requirement is concisely stated without

recourse to technical jargon, acronyms • verifiable: the implementation of the requirement can be

determined through one of four possible methods: inspection, demonstration, test, or analysis

the fixgood requirements (continued)

Tuesday, August 21, 12

Page 116: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixrequirements and acceptance criteria

Wikipedia on data quality: The processes and technologies involved in ensuring the conformance of data

values to requirements and acceptance criteria

Tuesday, August 21, 12

Page 117: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixrequirements and acceptance criteria

“a high level of accuracy”

Tuesday, August 21, 12

Page 118: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixrequirements and acceptance criteria

“article titles must be 99.5% accurate”

Tuesday, August 21, 12

Page 119: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixrequirements and acceptance criteria

“article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5

errors in 1000 article title characters”

Tuesday, August 21, 12

Page 120: 20120822 conversion of historic newspapers to digital objects [russian state library]

the illusionwaterfall requirements

for each product release repeat{

gather requirementscreate architecturedesignimplementtestuse -or- sell

}until (company goes out of business)

Tuesday, August 21, 12

Page 121: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixagile requirements

gather general requirementscreate architecturebuild prototype softwaretestrepeat{

use softwareadjust prototype and/or add new featuretest

}until (user says stop or runs out of money)

Tuesday, August 21, 12

Page 122: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixagile data conversion

create requirements and acceptance criteriarepeat{

digitize (small) pilot batchtest data against acceptance criteriaadjust requirements and acceptance criteria

}until (no more adjustments are necessary)digitize more data

Tuesday, August 21, 12

Page 123: 20120822 conversion of historic newspapers to digital objects [russian state library]

Tuesday, August 21, 12

Page 124: 20120822 conversion of historic newspapers to digital objects [russian state library]

“projects are about communication, communication, and communication”

the fixwhy (better) communication is necessary

Elenbass, B. (2000). “Staging a Project: Are You Setting Your Project Up for Success?”. Proceedings of the Project Management Institute Annual Seminars & Symposiums.

Tuesday, August 21, 12

Page 125: 20120822 conversion of historic newspapers to digital objects [russian state library]

• be impeccable with your word

• don’t take anything personally

• don’t make assumptions

• always do your best

• be mindful

the fixsimple principles for (good) communication

Tuesday, August 21, 12

Page 126: 20120822 conversion of historic newspapers to digital objects [russian state library]

no communication ...

the fixwhy (better) communication is necessary

Tuesday, August 21, 12

Page 127: 20120822 conversion of historic newspapers to digital objects [russian state library]

no communication ...little communication ...

the fixwhy (better) communication is necessary

Tuesday, August 21, 12

Page 128: 20120822 conversion of historic newspapers to digital objects [russian state library]

no communication ...little communication ...poor communication ...

the fixwhy (better) communication is necessary

Tuesday, August 21, 12

Page 129: 20120822 conversion of historic newspapers to digital objects [russian state library]

no communication ...little communication ...poor communication ...reduced communication ...

the fixwhy (better) communication is necessary

Tuesday, August 21, 12

Page 130: 20120822 conversion of historic newspapers to digital objects [russian state library]

no communication ...little communication ...poor communication ...reduced communication ...

... all result in more assumptions about intent!

the fixwhy (better) communication is necessary

Tuesday, August 21, 12

Page 131: 20120822 conversion of historic newspapers to digital objects [russian state library]

• communication is at most 30% verbal!

• remainder - 70% or more - is comprised of gestures, facial expressions, tone of voice, posture, odors, ...

• telephone communication removes gestures, facial expressions, posture, odors, etc. only words and tone of voice remain

• written communication - email, requirements, etc - removes all modes of communication save for words

the fixhow do you communicate?

Tuesday, August 21, 12

Page 132: 20120822 conversion of historic newspapers to digital objects [russian state library]

the fixhow to communicate

simple keep it simple stupid (KISS principle)

repeat say it twice in different ways

listen repeat what you hear

respect respect yourself and others

Tuesday, August 21, 12

Page 133: 20120822 conversion of historic newspapers to digital objects [russian state library]

for future projects give especial attention to

good, open communicationclear requirements

clear acceptance criteria

conclusion

Tuesday, August 21, 12

Page 134: 20120822 conversion of historic newspapers to digital objects [russian state library]

?We all admire the wisdom of people who come to us for advice.

Jack Herbert

2

Frederick ZarndtChair, IFLA Newspapers Section

[email protected]

Tuesday, August 21, 12