Top Banner
Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011
85

Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Dec 24, 2015

Download

Documents

Grace Butler
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Creating a New JHOVE2 Format Module

Sheila MorrisseyPortico

Code4Lib 2011Bloomington IN, February 7, 2011

Page 2: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

The preservation problemManaging the gap between what you were given and what you need

– That gap is only manageable if it is quantifiable

– Characterization tells you what you have, as a stable starting point for iterative preservation planning and action

Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.

Characterization

Preservation action

Preservation planning

Page 3: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

“What? So what?”

Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object

– Identification

– Feature extraction

– Validation

– Assessment

Determining the presumptive format of a digital object based on suggestive extrinsic hints and intrinsic signatures

Reporting the intrinsic properties of an object significant for classification, analysis, and planning

Page 4: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Supported formats

JHOVE2 can identify (by DROID) many more formats than it can validate (by modules)

– PRONOM registry documents over 550 “formats”http://www.nationalarchives.gov.uk/PRONOM

Page 5: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Supported formats

ICC color profile (ICC.1:2004-10)

JPEG 2000 JP2 (ISO/IEC 15444-1), JPX (ISO/IEC 15444-2)

PDF PDF 1.0 – 1.7, ISO 3200-1, PDF/A-1 (ISO 19005-1), PDF/X-1(ISO 15920-1), -1a (ISO 15930-4), -2 (ISO 15930-5) -3 (ISO 15930-6)

SGMLShapefileMain, Index, dBASE, …

TIFF TIFF 4 – 6, Class B, F, G, P, R, Y, TIFF/EP (ISO 12234-2),TIFF/IT (ISO 12639), GeoTIFF, Exif (JEITA CP-3451), DNG

UTF-8 ASCII (ANSI X3.4)

WAVE BWF (EBU N22-1997)

XMLZip

Page 6: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Contributed format modules

From Wegener Institute (http://www.awi-potsdam.de)– netCDF– Grib

From NationalbibliothekBibliothèque nationale de France (BnF) (http://www.bnf.fr/fr/acc/x.accueil.html)– arc– gzip

YOU!!!– ???

Page 7: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization strategy

1. Identify format (if not previously identified)

2. Dispatch to appropriate format module

a) Extract format features and validate– If a nested source unit is found, process

recursively…

b) Validate format profiles (if registered)3. Assess

4. If unitary source unit, calculate message digests (optional)

5. If an aggregate source unit, try to identify aggregate format, and if successful, process recursively…

Page 8: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization strategy

directory/

abc.shp abc.shx abc.dbf abc.tif xyz.pdf

Page 9: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization strategy

directory/

abc.shp abc.shx abc.dbf abc.tif

Main Index dBASE GeoTIFF

xyz.pdf

PDF

Page 10: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization strategy

directory/

abc.shp abc.shx abc.dbf

abc.tifclump

Main Index dBASE

GeoTIFF

Shapefile xyz.pdf

PDF

Page 11: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization strategy

directory/

abc.shp abc.shx abc.dbf

abc.tif

clump

clump

Main Index dBASE

GeoTIFF

Shapefile

“GIS object” xyz.pdf

PDF

Page 12: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

API design idioms

Separation of concerns– Annotation and reflection

confluence.ucop.edu/display/JHOVE2Info/Background+Papers

Inversion of control (IOC) / dependency injection– Martin Fowler

martinfowler.com/articles/injection.html

– Spring frameworkwww.springsource.org/

Page 13: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Separation of concerns

“Let POJOs be POJOs”– Focus on modeling the format itself

“Let the code write itself”– Reportables “know” how to expose their

properties for display– Reference documentation generated from the

code

Page 14: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Annotation and Reflection:Reportable properties

Each reportable property is represented by a field and accessor and mutator methodsThe accessor method must be marked with the @ReportableProperty annotation

public class MyReportable implements Reportable{ protected String myProperty;

@ReportableProperty(order=1, desc=“description”, ref=“reference”) public String getMyProperty() { return this.myProperty; } public void setMyProperty(String property) { this.myProperty = property; }}

Page 15: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Dependency injection

All JHOVE2 function is embodied in pluggable modules

– Flexible customization Re-sequencing of pre-existing modules

– Easy extensibility Additional format modules and profiles Additional aggregate identifiers Additional displayers New behaviors

RenderabilityModule

Page 16: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

JHOVE2 framework

Embodiment of a characterization strategy as a configurable sequence of command-invoked modules

public void characterize(Source source, Input input) throws IOException, JHOVE2Exception{ source.getTimerInfo().setStartTime();/* Update summary counts of source units, by type. */ this.sourceCounter.incrementSourceCounter(source); for (Command command : this.commands){ TimerInfo time2 = command.getTimerInfo(); time2.resetStartTime(); try { command.execute(this, source, input); } finally { time2.setEndTime(); } } source.getTimerInfo().setEndTime();}

Page 17: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Characterization

Page 18: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Creating a New Format Module:What are the deliverables?

• Source code• Configuration files• Sample (test) files• Documents

Page 19: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:Source Code

• Module classes– Module (extends org.jhove2.module.format.BaseFormatModule)

– Profiles (extend org.jhove2.module.format. AbstractFormatProfile) as required by format

– Supporting classes expressing format content model as required by format

• Test classes– JUnit test(s)

Page 20: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module ArtifactsConfiguration Files

• Spring IOC Bean XML configuration files,• For Module• For unit test as needed• For Assessment criteria

• Messages properties file additions if needed• Properties files

• Displayer• Units of measure• Module-specific

Page 21: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:Sample (Test) Files

–Sample files used in unit test• Valid files• Invalid files to exercise validity constraints

Page 22: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:Documentation

• Module Specification DocumentSee examples on the JHOVE2 wiki “Modules Documents” page

<https://bitbucket.org/jhove2/main/wiki/Module>

Page 23: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts ListNew CSV Format Module

Source codesrc/main/java/org/jhove2/module/format/csv/CsvModule.javasrc/test/java/org/jhove2/module/format/csv/CsvModuleTest.java

Configuration filesSpring

config/spring/module/format/csv/jhove2-csv-config.xmlconfig/spring/module/assess/jhove2-ruleset-csv-config.xmlsrc/test/resources/config/module/format/csv/test-config.xml

Messagesconfig/messages/jhove2_message.properties (update, not new)

Displayconfig/properties/module/displayer/org/jhove2/module/format/csv/CsvModule_displayer.propertiesconfig/properties/module/units/org/jhove2/module/format/csv/CsvModule_unitproperties (optional)

Module-specific properties filesconfig/properties/module/format/csv/csv.properties (optional, implementation-determined)

Test File(s)src/test/resources/examples/csv/goodFile.csvsrc/test/resources/examples/csv/badFile01.csvsrc/test/resources/examples/csv/badFile02.csv….

DocumentationCSV Module specification document: Jhove2 wiki

Page 24: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:The Good News

• Generate module and profile from interfaces and base classes via inheritance– Classes reflect format’s own content model: cross-cutting “JHOVE2”

concerns handled via annotation (persistence, serialization, generation of JHOVE2 identifiers for reportable properties)

• Template for Spring XML Module configuration files• Utilities to generate

– Displayer properties files– Units of measure properties files– XML assessment configuration file

• Utilities for specification document– Script to generate tabular content for specification document– Macro to import utility-generated tabular content

Page 25: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Research and Analysis

• Format Definition (org.jhove2.core.format.Format)– Names– Type (format/family)– Ambiguity (ambiguous/unambiguous)– Identifiers– Specifications– Validity (comprehensive/selective)– Profiles (none)

• Significant (Reportable) properties (org.jhove2.module.format.csv.CsvFormatModule)

Page 26: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition:CSV Names

• JHOVE2 canonical name– Comma Separated Values

• Format aliases– CSV– DSV

Might already be defined in config/spring/module/format/jhove2-otherFormats-config.xml

Page 27: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Formal Identifiers

• JHOVE2 identifier (see org.jhove2.core.I8R$Namespace)– [JHOVE2] http://jhove2.org/terms/format/csv

• PRONOM (PUID) identifier (used by DROID)– [PUID] x-fmt/18

• MIME type identifier– [MIME] text/csv

• RFC identifer– [RFC] text/csv

• Other identifiers in other namespaces (see org.jhove2.core.I8R$Namespace)

Might already be defined in config/spring/module/format/jhove2-otherFormats-config.xmlIf you are not using DROID, then you MUST have the identifier(s) from the namespace of your identification tool

Page 28: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Formal Identifiers in Spring

<!– Comma Separated Values JHOVE2 identifier bean --> <!-- (canonical identifier in JHOVE2 namespace) --><!– Single constructor arg defaults to JHOVE2 namespace -->

<bean id="CommaSeparatedValuesIdentifier" class="org.jhove2.core.I8R" scope="singleton">

<constructor-arg type="java.lang.String" value="http://jhove2.org/terms/format/csv"/></bean>

<!– Comma Separated Values PUID identifier bean --><!-- (canonical identifier in PRONOM namespace (used by DROID identifier tool)

--><bean id="CommaSeparatedValuesPUID1" class="org.jhove2.core.I8R"

scope="singleton"> <constructor-arg type="java.lang.String”value="x-fmt/18"/> <constructor-arg type="org.jhove2.core.I8R$Namespace" value="PUID"/></bean

Page 29: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Formal Identifiers in Spring

<!–- Comma Separated Values MIME type aliasIdentifier bean --><bean id="CommaSeparatedValuesMIMEType" class="org.jhove2.core.I8R"

scope="singleton"><constructor-arg type="java.lang.String" value="text/csv"/><constructor-arg type="org.jhove2.core.I8R$Namespace" value="MIME"/>

</bean>

<!–- Comma Separated Values RFC aliasIdentifier bean--><bean id="CommaSeparatedValuesRFC4180" class="org.jhove2.core.I8R"

scope="singleton"><constructor-arg type="java.lang.String" value="RFC 4180"/><constructor-arg type="org.jhove2.core.I8R$Namespace" value="RFC"/>

</bean>

Page 30: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Specifications

• For CSV, many variants• Closest document to a format spec is RFC

– RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt)

Page 31: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Specification in Spring

<bean id=“CsvSpec" class="org.jhove2.core.Document" scope="singleton"><constructor-arg type="java.lang.String"

value=“RFC 4180 Common Format and MIME Type for CSV Files"/><constructor-arg type="org.jhove2.core.Document$Type" value="Specification"/><constructor-arg type="org.jhove2.core.Document$Intention" value="Authoritative"/><property name="author" value=“Y. Shafranovich"/><property name="date" value=“October 2005"/><property name="identifiers">

<list value-type="org.jhove2.core.I8R"><ref bean=" CsvSpecificationURI "/>

</list></property><property name="publisher" value="The Internet Engineering Task Force (IETF)"/>

</bean>

<!–- CSV RFC specification URI bean --><bean id=“CsvSpecificationURI" class="org.jhove2.core.I8R" scope="singleton">

<constructor-arg type="java.lang.String" value=“http://www.ietf.org/rfc/rfc4180.txt"/><constructor-arg type="org.jhove2.core.I8R$Namespace" value="URI"/>

</bean>

Page 32: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Definition :CSV Format Bean Definition in Spring

<!-- Bean for the JHOVE2 Comma Separated Values Format Bean --> <bean id="CommaSeparatedValuesFormat" class="org.jhove2.core.format.Format" scope="singleton"><constructor-arg type="java.lang.String" value="Comma Separated Values"/><constructor-arg ref="CommaSeparatedValuesIdentifier"/> <constructor-arg type="org.jhove2.core.format.Format$Type" value="Format"/> <constructor-arg type="org.jhove2.core.format.Format$Ambiguity" value="Unambiguous"/><property name="aliasIdentifiers">

<set value-type="org.jhove2.core.I8R"><ref bean="CommaSeparatedValuesIdentifier"/><ref bean="CommaSeparatedValuesPUID1"/><ref bean="CommaSeparatedValuesMIMEType"/><ref bean="CommaSeparatedValuesRFC4180"/>

</set></property><property name="aliasNames">

<set><value>CSV</value><value>DSV</value>

</set></property><property name="specifications">

<list value-type="org.jhove2.core.Document"><ref bean="CsvSpec"/>

</list></property></bean>

Page 33: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Format Module Recipe

• Create package• Place in inheritance hierarchy• Enforce persistence requirements• Populate static (non-user-configurable) fields• Implement 2-argument constructor • Create module’s Spring Bean• Define reportable properties and associated methods• Annotate reportable properties accessors• Configure Message properties file• Override parse() method• Implement Validator interface methods

Page 34: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Create Package

• Package– org.jhove2.module.format.csv

Page 35: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Inheritance Hierarchy

• Inheritance– Extends org.jhove2.module.format.BaseFormatModule

– Implements org.jhove2.module.format.Validator

Page 36: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Persistence requirements

• Module must be annotated with the BerkeleyDBJE @Persistent annotation

• Module must have a 0-argument constructor• Module should not contain any non-static nested (inner)

classes• Module field type must be

– “simple” Java type or– Persistent type or– Have a com.sleepycat.persist.model.PersistentProxy

implementation created for it in package org.jhove2.persist.berkeleydpl.proxies

Annotate Reportable Properties

Page 37: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Persistence requirements

import com.sleepycat.persist.model.Persistent;

// Berkeley DB JE annotation@Persistent

public class CsvModule extends BaseFormatModule implements Validator{/** * No-arg constructor required by persistence layer */@SuppressWarnings("unused")private CsvModule() {

this(null, null);}…

Page 38: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Non-configurable fields

@Persistentpublic class CsvModule extends BaseFormatModule implements Validator{/** Directory module version identifier. */public static final String VERSION = "n.n.n";/** Directory module release date. */public static final String RELEASE = "yyyy-mm-dd";/** Directory module rights statement. */public static final String RIGHTS = "Copyright YYYY by "+ "Copyright holder name "+ "Available under the terms of the BSD license.";/** Module validation coverage. */public static final Coverage COVERAGE = Coverage.Inclusive;/** CSV validation status. */protected Validity validity;

Page 39: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Two-argument Constructor

/** * @param format * @param formatModuleAccessor */public CsvModule(Format format, FormatModuleAccessor

formatModuleAccessor) {super(VERSION, RELEASE, RIGHTS, format, formatModuleAccessor);this.validity = Validity.Undetermined;

}…

Page 40: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Spring Bean

<bean id="CSVModule" class="org.jhove2.module.format.csv.CsvModule" scope="prototype"><constructor-arg ref="CommaSeparatedValuesFormat"/><!–- persistence manger bean ref; same for all format modules =<constructor-arg ref="FormatModuleAccessor"/><property name="developers">

<list value-type="org.jhove2.core.Agent"><ref bean="CSVAgent"/>

</list></property>

</bean>

<!–- Module author bean -<bean id="CSVAgent" class="org.jhove2.core.Agent" scope="singleton">

<constructor-arg type="java.lang.String" value="CSV Author Name"/><constructor-arg type="org.jhove2.core.Agent$Type" value=“Personal"/> <!-- Personal or Corporate -<property name="URI" value="http://www.csvagent.org/"/>

</bean>

Page 41: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Reportable Properties:CSV Base Definition

file = [header CRLF] record *(CRLF record) [CRLF]header = name *(COMMA name)record = field *(COMMA field)name = fieldfield = (escaped / non-escaped)escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF /

2DQUOTE) DQUOTEnon-escaped = *TEXTDATACOMMA = %x2CCR = %x0D ;as per section 6.1 of RFC 2234 [2]DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]LF = %x0A ;as per section 6.1 of RFC 2234 [2]CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

From RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt)

Page 42: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Reportable Properties: CSV Complications

• Delimiter character might be “;” instead of “,”• EOL might be “\n” instead of “\r\n”• EOL might be embedded in contents of field• Different implementations escape the escape character

differently– “” vs. \”

• Last record in file might not have EOL• All records might not have same number of fields• Some implementations trim leading/trailing whitespace in

escaped fields• Some implementations allow characters other than ASCII-

printable characters• No syntactic way to detect if first record is “header” record

Page 43: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: CSV Reportable Properties

• Delimiter character• EOL character(s)• Escape character• Escape character sequence within field• Number of records• Number of fields

– First record– Max– Min– Per record

• Field names from header row• Count of records with embedded EOL• Count of records with embedded escape characters• Count of records with leading/trailing whitespace in escaped fields• Does last record in file have EOL?• Does file contain characters other than ASCII-printable ones?

Page 44: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: CSV Reportable Properties

• Add significant properties as protected fields to module class – Might need to create ancillary @Persistent class to

reflect model of format– Class should extend org.jhove2.core.reportable.AbstractReportable

• Create public accessors for those fields• Annotate accessors with @ReportableProperty

annotation

Page 45: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Reportable Properties: Fields

// Add significant properties as protected fieldsprotected String delimiterCharacter;protected String eolString;protected String escapeCharacter;protected String escapeCharacterSequenceWithinField;protected int recordCount;protected int fieldCountFirstRecord;protected int fieldCountMax;protected int fieldCountMin;protected List<Integer> fieldsPerRecord;protected List<String> fieldNames;protected int recordsWithEmbeddedEolCount;protected int recordsWithEmbeddedEscapeCharCount;protected int recordsWithUntrimmedWhitespaceCount;protected boolean eolInLastRecord;protected boolean containsNonAsciiPrintableChars;

Page 46: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Reportable Properties: Accessors

// Create public accessors for reportable properties fieldspublic String getDelimiterCharacter() {...}public String getEolString() {...}public String getEscapeCharacter() {...}public String getEscapeCharacterSequenceWithinField() {...}public int getRecordCount() {...}public int getFieldCountFirstRecord() {...}public int getFieldCountMax() {...}public int getFieldCountMin() {...}public List<Integer> getFieldsPerRecord() {...}public List<String> getFieldNames() {...}public int getRecordsWithEmbeddedEolCount() {...}public int getRecordsWithEmbeddedEscapeCharCount() {...}public int getRecordsWithUntrimmedWhitespaceCount() {...}public boolean isEolInLastRecord() {...}public boolean isContainsNonAsciiPrintableChars() {...}

Page 47: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Reportable Properties: Annotation

public @interface ReportableProperty { /** Default description and reference value. */ public static final String DEFAULT = "Not available."; /** * Property type: raw or descriptive. A raw property reports itself in the exact form that was found * in the source unit; a descriptive property reports itself in a more human-readable form. */ public enum PropertyType {Default, Raw, Descriptive} /** * Ordinal position of this property relative to all properties directly defined in a class. */ public int order() default 1; /** * Property reference, a citation to an external source document that defines the property. */ public String ref() default DEFAULT;

/** Property type: raw or descriptive. */ public PropertyType type() default PropertyType.Default;

/** Property description. */ public String value() default DEFAULT;}

Page 48: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Reportable Properties: Annotation

@ReportableProperty( order=10, value="Character used to delimit fields in record.",

ref="RFC 1480, Section 2, paragraph 4")public String getDelimiterCharacter() {return delimiterCharacter;

}

Page 49: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Reportable Message Properties

import org.jhove2.core.Message;

…// (Reportable) Message propertiesprotected Message delimiterCharNotFoundMessage;

Page 50: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Configure Message Properties File

############################################################################## Message templates for class org.jhove2.module.format.csv.CsvModule# #########################################################################

org.jhove2.module.format.csv.CsvModule.DelimitorCharacterNotFoundMessage=No occurrence of delimiter character {0} found in source

#

Added to file config/messages/jhove2_messages.properities

Page 51: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module:Message Creation

Object[]messageArgs = new Object[]{csvDelimiterChar};

delimiterCharNotFoundMessage = new Message( Severity.WARNING,

Context.OBJECT,"org.jhove2.module.format.csv.CsvModule.DelimitorCha

racterNotFoundMessage",messageArgs,jhove2.getConfigInfo());

Page 52: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Override Parse() method

/** * Parse a source unit. * @param jhove2 JHOVE2 framework * @param sourceunit * @param input CSV source input * @return Number of bytes consumed * @throws EOFException * @throws IOException * @throws JHOVE2Exception */

@Override public long parse(JHOVE2 jhove2, Source source, Input input) throws IOException, JHOVE2Exception { // where the real work happens // parse the Source (take care of those CSV complications!!) // populate reportable properties // construct any Error, Warning, or Info messages return 0; }

Page 53: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Override Parse() method

Some Implementation Choices:• Write from scratch

– TIFF– WAV– UTF-8– ICC

• Wrap existing JAVA library– XML– Beware of persistence traps: Inner classes, non-persistable fields

• Wrap existing non-JAVA library– SGML– Beware of performances hits (shell out) or memory leaks (JNI)

Page 54: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Implement Validator methods

/* (non-Javadoc) * @see org.jhove2.module.format.Validator#getCoverage() */@Overridepublic Coverage getCoverage() {

return this.COVERAGE;}/* (non-Javadoc) * @see org.jhove2.module.format.Validator#isValid() */@Overridepublic Validity isValid() {

return this.validity;}

Page 55: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Implement Validator methods

/* (non-Javadoc) * @see

org.jhove2.module.format.Validator#validate(org.jhove2.core.JHOVE2, org.jhove2.core.source.Source, org.jhove2.core.io.Input)

*/@Overridepublic Validity validate(JHOVE2 jhove2, Source source, Input

input)throws JHOVE2Exception {

//Parse might already have set validity; if not; test //reportable fields values and setif (this.validity.equals(Validity.Undetermined)){

//...}return this.validity;

}

Page 56: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Unit Test

• JUnit 4• Important to include both good and bad

sample files

Page 57: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Unit Test

package org.jhove2.module.format.csv;import static org.junit.Assert.*;import org.junit.Before;import org.junit.Test;

public class CsvModuleTest {@Before

public void setUp() throws Exception {}@Test

public void testValidate() {fail("Not yet implemented");

}@Test

public void testParse() {fail("Not yet implemented");

}}

Page 58: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: Unit Test: Where it Goes

Unit tests: src/test/java/org/jhove2/module/format/csv

Sample (test) files src/test/resources/examples/csv

Spring beans for unit tests: src/test/resources/config/module/format/csv– Update Spring configuration file filepaths-config.xml with

base path of your sample file <bean id="csvDirBasePath" class="java.lang.String" >

<constructor-arg type="java.lang.String" value="examples/csv/"/>

</bean>

Page 59: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:What’s Left?

• Source code• Configuration files• Sample (test) files• Documents

Page 60: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module ArtifactsConfiguration Files

• Spring IOC Bean XML configuration files,• For Module• For unit test as needed• For assessment

• Messages properties file additions if needed• Properties files

– Displayer– Units of measure– Module-specific

Page 61: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module: CSV Assessment Criteria

• Delimiter character=?• EOL character(s)=?• Escape character =?• Escape character sequence =?• All records have same number of columns?• Contains no escaped fields with untrimmed

whitespace?• Contains no characters other than ASCII-printable?• Contains no fields with embedded EOL?See Richard Anderson’s workshop this

afternoon!!!!

Page 62: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:“We’ve got an app for that!”

• Displayer– jhove2_dpfg.cmd (Windows)– jhove2_dpfg.sh (Unix)

• Units of measure– jhove2_upfg.cmd (Windows)– jhove2_upfg.sh (Unix)

Page 63: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Displayer Properties

USAGE:jhove2_dpfg.cmd <fully-qualified-classname> <output-directory-path>

Page 64: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Displayer Properties

Example:jhove2_dpfg.cmd org.jhove2.module.format.csv.CsvModule c:\props

Command line output:Succesfully created displayer property file for class org.jhove2.module.format.csv.CsvModule

File can be found at c:\props\org\jhove2\module\format\csv\CsvModule_displayer.properties

Page 65: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Editable File

# _displayer.properties# The visibility directives control the display of the properties identified by URI# The directives can be: Always, IfFalse, IfNegative, IfNonNegative, IfNonPositive,# IfNonZero, IfPositive, IfTrue, IfZero, Never# A property is not displayed if its value is not consistent with the directive.# Negative means ...,-2,-1; NonNegative means 0,1,2...# Positive means 1,2,3,...; NonPositive means ...,-2,-1,0http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/DelimiterCharacter Always | Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/EolString Always | Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/EscapeCharacter Always | Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/EscapeCharacterSequenceWithinField Always |

Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldCountFirstRecord Always | Never |

IfNegative | IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldCountMax Always | Never | IfNegative |

IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldCountMin Always | Never | IfNegative |

IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldNames Always | Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldsPerRecord Always | Neverhttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/RecordCount Always | Never | IfNegative |

IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/RecordsWithEmbeddedEolCount Always | Never |

IfNegative | IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/RecordsWithEmbeddedEscapeCharCount Always |

Never | IfNegative | IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/RecordsWithUntrimmedWhitespaceCount Always

| Never | IfNegative | IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZerohttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/isContainsNonAsciiPrintableChars Always |

Never | IfTrue | IfFalsehttp\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/isEolInLastRecord Always | Never | IfTrue |

IfFalse

Page 66: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Editable File

# _displayer.properties# The visibility directives control the display of the properties

identified by URI# The directives can be: Always, IfFalse, IfNegative, IfNonNegative,

IfNonPositive,# IfNonZero, IfPositive, IfTrue, IfZero, Never# A property is not displayed if its value is not consistent with the

directive.# Negative means ...,-2,-1; NonNegative means 0,1,2...# Positive means 1,2,3,...; NonPositive means ...,-2,-1,0

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/DelimiterCharacter Always | Never

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldCountFirstRecord Always | Never | IfNegative | IfNonNegative | IfNonPositive | IfNonZero | IfPositive | IfZero

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/isContainsNonAsciiPrintableChars Always | Never | IfTrue | IfFalse

Page 67: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Editable File

# _displayer.properties# The visibility directives control the display of the properties

identified by URI# The directives can be: Always, IfFalse, IfNegative, IfNonNegative,

IfNonPositive,# IfNonZero, IfPositive, IfTrue, IfZero, Never# A property is not displayed if its value is not consistent with the

directive.# Negative means ...,-2,-1; NonNegative means 0,1,2...# Positive means 1,2,3,...; NonPositive means ...,-2,-1,0

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/DelimiterCharacter Always

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/FieldCountFirstRecord IfPositive

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/isContainsNonAsciiPrintableChars IfTrue

Page 68: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Units of Measure Properties

USAGE:jhove2_upfg.cmd <fully-qualified-classname> <output-directory-path>

Page 69: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Units of Measure Properties

Example:jhove2_upfg.cmd org.jhove2.module.format.csv.CsvModule c:\props

Command line output:Succesfully created unit property file for class org.jhove2.module.format.csv.CsvModule

File can be found at c:\props\org\jhove2\module\format\csv\CsvModule_unit.properties

Page 70: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Editable File

# Units of measure properties# Note: These unit of measure labels are descriptive only; changing the label# does NOT change the determination of the underlying property value.http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

RecordCount http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

RecordsWithUntrimmedWhitespaceCount http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

RecordsWithEmbeddedEscapeCharCount http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

FieldCountMax http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

FieldCountMin http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

FieldCountFirstRecord http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

RecordsWithEmbeddedEolCount

Page 71: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Configuration Files:Editable File

# Units of measure properties# Note: These unit of measure labels are descriptive only; changing the label# does NOT change the determination of the underlying property value.

http\://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

RecordsWithEmbeddedEolCount record

Page 72: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Format Module Artifacts:What’s Left?

• Source code• Configuration files• Sample (test) files• Documents

– Format Module Specification Document• “We’ve got an app for (part of) that!”

Page 73: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Specification Sections

1. Introduction2. Identification3. References4. Terminology and Conventions5. Validity6. Format Profiles7. Reportable Properties8. Configuration9. Implementation Notes

Page 74: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Minimal template edit

1. Introduction2. Identification3. References4. Terminology and Conventions5. Validity6. Format Profiles7. Reportable Properties8. Configuration9. Implementation Notes

Page 75: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Sections from Tabular Data

1. Introduction2. Identification3. References4. Terminology and Conventions5. Validity6. Format Profiles7. Reportable Properties8. Configuration9. Implementation Notes

Page 76: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Write “By Hand”

1. Introduction2. Identification3. References4. Terminology and Conventions5. Validity6. Format Profiles7. Reportable Properties8. Configuration9. Implementation Notes

Page 77: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

DocumentationModule Specification Recipe

• Create module specification from Word Template• Generate tabular information (reportable

properties)• Use Word macro to format tabular information

for pasting into module specification• Complete other sections• Add specification document to JHOVE2 wiki

Page 78: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Create Tabular Data

• Generate tabular information (reportable properties) for format module specification– jhove2_doc.cmd (Windows)– jhove2_doc.sh (Unix)

Page 79: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Create Tabular Data

USAGE:jhove2_doc.cmd<fully-qualified-classname> <output-directory-path

Page 80: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Create Tabular Data

• Outputs– CsvModule_id.txt

• (Section 2: Identification)

– CsvModule_ref.txt • (Section 3: References)

– CsvModule_Reportable_properties.txt• (Section 7: Reportable properties)

Page 81: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Format tabular data with Macro

• Edit the output file in WordPad or NotePad to save with MS line endings)

• Follow instructions in Macro file to create formatted text

• Copy and paste in Specification document

Page 82: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Create Tabular Data

IN generated file:

Property DelimiterCharacterIdentifier http://jhove2.org/terms/property/org/

jhove2/module/format/csv/CsvModule/DelimiterCharacterType java.lang.StringDescription Character used to delimit fields in

record.ReferenceRFC 1480, Section 2, paragraph 4

Page 83: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Documentation :Create Tabular Data

DelimiterCharacter PropertyIdentifier http://jhove2.org/terms/property/org/jhove2/module/format/csv/CsvModule/

DelimiterCharacter

Type java.lang.StringDescription Character used to delimit fields in record.Reference RFC 1480, Section 2, paragraph 4

Page 84: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

DocumentationModule Specification Recipe

• Create module specification from Word Template• Generate tabular information (reportable

properties)• Use Word macro to format tabular information

for pasting into module specification• Complete other sections• Add specification document to JHOVE2 wiki

Page 85: Creating a New JHOVE2 Format Module Sheila Morrissey Portico Code4Lib 2011 Bloomington IN, February 7, 2011.

Questions?http://jhove2.org

[email protected]@listserv.ucop.edu

CDLStephen AbramsPatricia CruseJohn KunzeIsaac RabinovitchMarisa StrongPerry Willett

Stanford UniversityRichard AndersonTom CramerHannah Frost

PorticoJohn MeyerSheila Morrissey

Library of CongressMartha AndersonJustin Littman

With help fromWalter HenryNancy HoebelheinrichKeith JohnsonEvan Owens

Advisory BoardDeutsche NationalbibliothekDspace / MITEx LibrisFedora Commons / RutgersFlorida Center for Library AutomationHarvard UniversityKoninklijke BibliotheekNational Archives (UK)National Archives (US)National Library of AustraliaNational Library of New ZealandNationalbibliothekBibliothèque nationale de France (BnF)Planets / Universität zu KölnTessella