Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru
Mar 19, 2016
Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format
Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler,
Umadevi Thanneeru
Portico & JSTOR: Committed to Preserving the Scholarly Record
JATS-CON 2010
I T H A K A
Ithaka helps the academic community
use digital technologies to preserve the
scholarly record and to advance research and teaching in sustainable
ways
Digitization for Preservation & AccessDigital Preservation“Dark Archive” “Light Archive”
Portico Archive
• Portico’s objective is to help libraries make a secure and reliable transition from print to a reliance on e-content.
• Maintains archiving agreement with publishers to collect and preserve content.
• Receives content directly from publishers.
• Preserves:– Current journals (born digital)– Back file journals (reborn
digital)– E-books– Digitized historical collections
JATS-CON 2010
An “Insurance Policy” for e-Content
• Provide libraries with access to archived content when it becomes lost, orphaned or abandoned (regardless of libraries past or current subscription):
1.Publisher ceases operation2.Publisher discontinues title3.Publisher drops back file
JATS-CON 2010
•Provide libraries with post-cancellation access – if publisher specifically names Portico
•About 90% of titles in Archive are covered by Portico post-cancellation access rights.
•Libraries asked to pay annual Archive support payment to defray cost of preservation, e.g. “insurance premium”
Portico Archive as of July 19, 2010
Category Files %
Images 84,215,731 47.93%
Publisher Supplied Text 47,393,731 26.98%
Portico Created Archival Text
43,689,083 24.87%
Application Specific Files 232,732 0.13%
Multi-file Packages 140,333 0.08%
Videos 20,604 0.01%
Audio 570 <0.00%
Executable 6 <0.00%
Total 175,692,826 100%
• 114 publisher participants• 11,788 committed journal titles• 43,253 committed e-books• 13 committed digitized collections
• >14 million articles ingested
• 688 library participants– (48% outside US)
• 4 Trigger events• 15 Post-cancellation Access Claims
JATS-CON 2010
Portico Preservation Infrastructure
JATS-CON 2010
• Publisher supplies XML Source file (including the text, images) and PDF page rendition. • Best approach for preserving the intellectual content of the article or book.
• Authenticate: verify that preserved content is what it purports to be.
• Verify format: ensure the file meets syntactic and semantic rules of format specification. • Repair
• Normalize (XML)
• Create preservation metadata
• Assess archival robustness of file format.
• Migrate files to ensure future usability of content.
• Replicate objects and metadata to protect against bit rot and media deterioration
• Render articles to meet viewing requirements of delivery platform.
Key Challenges for an Archival DTD
Dec 2001, Inera’s “E-Journal Archive DTD Feasibility Study” highlighted these Key Challenges for an Archival DTD:
• Use of generated and boilerplate text, especially in – Label text for figure captions– Citation text– Author name and affiliation– Dates
• Expression of links between author and affiliation• Reference elements• Expression of non-article and other content• Abbreviations and definitions
JATS-CON 2010
Key Challenges for an Archival DTD
• Keywords• Sections, including handling of sections without headers• Placement of floating objects, such as figures, tables, graphs• Tables, including cell formatting issues (cells with figures,
content alignment, etc.)• Math• Intra-, inter- and extra-article linking• Publisher-specific elements
When reviewing the minutes of the Working Group and the evolution of the DTD, we can confirm that these areas have
been the main focus of discussion.
JATS-CON 2010
Some Design Constraints
• IMPLIED, not REQUIRED attributes
• CDATA instead of controlled list
• Optional Elements, or relaxed order of elements
• Surprising location of Elements
• No Domain Specific Elements
JATS-CON 2010
Publisher/Domain Specific Elements
• Custom-Meta– Business Data– Allowed in journal-meta, article-meta, front-stub– Name/Value pair (may contain 38 different
Elements)
• Named-Content– Semantic Significance– Allowed in 112 Elements– May contain 59 different Elements
JATS-CON 2010
Challenges posed by source DTDs
Extended Semantics for Named-Content
• Price in Citation– Becomes <named-content content-type=“price”>
<citation reference="1" id="R1" type="serial"> <author order="1"> <name><first>S. P.</first><last>Morgan</last></name> </author> <journal> <sertitle>J. Appl. Phys.</sertitle> <URI type="ISSN">0030-3941</URI> <price>$01.00</price> <volume>29</volume> <pages><first>1358</first><last>1368</last></pages> <pubdate>1958</pubdate> </journal> <title>General solution of the Luneburg lens problem</title></citation>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Affiliation in Footnotes/P– Becomes <named-content content-type=“aff” id=“AFF2”>
<FOOTNOTE ID="N101" TYPE="AFF"><P ALPHABET="LATIN" TYPE="INDENT"> <AFF ID="AFF2“><IT>Corresponding author address:</IT> Nicholas M. J. Hall, Dept. of Atmospheric and Oceanic Sciences, McGill University, 805 Sherbrooke St. W., Montreal PQ H3A 2K6, Canada.</AFF>
</P></FOOTNOTE>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Funding in Acknowledgments/P– Becomes <named-content content-type=“funding”>
<ack><sectitle>ACKNOWLEDGMENTS</sectitle><p>Q.W.’s research is partially supported by AFOSR Grant No. <funding source="USAFOSR"><contract>F49550-05-1-0025</contract></funding> and NSF Grants No. <funding source="NSF"><contract>DMS-0204243</contract></funding>, No. <funding source="NSF"><contract>DMS-0605029</contract></funding>, and No. <funding source="NSF"><contract>DMS-0626180</contract></funding>. P.Z. is partially supported by the special funds for major State Research Projects <funding source="UNSPECIFIED"><contract>2005CB321704</contract></funding> and National Science Foundation of China for Distinguished Young Scholars <funding source="NSFC"><contract>10225103</contract></funding>. H.Z.’s work is supported in part by the Naval Postgraduate School Research Initiation Program.</p></ack>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Organization Division in Affiliation– Becomes <named-content content-type=“division”>
<Affiliation ID="Aff12"> <OrgDivision>Optisches Institut</OrgDivision> <OrgName>Technische Universität Berlin</OrgName> <OrgAddress> <City>Berlin</City> <Country>Germany</Country> </OrgAddress> </Affiliation>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Generic Element (addinfo)– Becomes <named-content content-type=“addinfo”>
<ref-conf id="CIT0045"><ref-conf-text><author-ref-text><surname>Bishop</surname> <givenname>CJ</givenname></author-ref-text>, <author-ref-text><surname>Aanenses</surname> <givenname>DM</givenname></author-ref-text>, <author-ref-text><surname>Jordan</surname> <givenname>GE</givenname></author-ref-text>, <author-ref-text><surname>Kilian</surname> <givenname>M</givenname></author-ref-text>, <author-ref-text><surname>Hanage</surname> <givenname>WP</givenname></author-ref-text>, <author-ref-text><surname>Spratt</surname> <givenname>BG.</givenname></author-ref-text> <presentationtitle>Electronic taxonomy: assigning strains to bacterial species via the internet</presentationtitle>. <collectworktitle>BMC Biology</collectworktitle> <publicationfield-text><year>2009</year>; <year>7</year></publicationfield-text>: <firstpage>3</firstpage>. <addinfo>doi:10.1186/1741-7007-7-3</addinfo>.</ref-conf-text> </ref-conf>
JATS-CON 2010
Challenges posed by source DTDs
Target DTD Structural Constraints that force the use of Named-Content
• Table in Table– TD contains named-content, which contains a table
<td><named-content content-type=“table”><table-wrap>
• Figure in Table– TD contains named-content, which contains a fig
<td><named-content content-type=“figure”><fig>
• Display-Formula in Title– Title contains named-content, which contains a display-formula
<title><named-content content-type=“display-formula”><display-formula>
JATS-CON 2010
Challenges posed by source DTDs
• Question/Answer– Generic and Structural– Is saying <list list-content=“question”> enough?
<Question-Answer> <Q><P><L>1</L>. The major advantage of amniotic membrane transplantation in pterygium surgery is</P></Q> <A><P><L>A</L>. reduction in surgical time</P></A> <A><P><L>B</L>. preservation of conjunctiva</P></A> <A><P><L>C</L>. better cosmetic outcomes compared with conjunctival autografting</P></A> <A><P><L>D</L>. lowest recurrence rate among the surgical techniques</P></A></Question-Answer>
JATS-CON 2010
Challenges posed by source DTDs
• Synonymy– Domain and Semantic– Is saying <list list-content=“synonymy”> enough?– Or <named-content content-type=“synonymy”> because of the
semantic meaning?
<SYNONYMY> <HEAD>ECHINOSTELIALES</HEAD> <ITEM><P><GENSP>Clastoderma debaryanum</GENSP> A. Blytt</P></ITEM> <ITEM><P><GENSP>Echinostelium apitectum</GENSP> K.D. Whitney, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium coelocephalum</GENSP> T.E. Brooks & H.W. Keller,
MC</P></ITEM> <ITEM><P><GENSP>Echinostelium minutum</GENSP> de Bary, MC</P></ITEM></SYNONYMY>
Synonyms are different scientific names that pertain to the same taxon
JATS-CON 2010
Challenges posed by source DTDs
• Decision Tree (Taxonomic Key)– Domain, Semantic, Structural, and Presentation
<KEY> <COUPLET><DESCR><NO>1.</NO>Hypostomal setae (Hy) shorter than half the width of labrum</DESCR> <RESP><GENSP>Sycophila mellea</GENSP> (Curtis, 1831), <GENSP>Tetramesa </GENSP>Walker, 1848</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Hypostomal setae longer or about as long as half the width of labrum</DESCR> <RESP>2</RESP></COUPLET> <COUPLET><DESCR><NO>2.</NO>More than two dorsal setae (D) present on abdominal segments A6-8</DESCR> <RESP>3</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--At least one of abdominal segments A6-8 with only two dorsal setae</DESCR> <RESP>4</RESP></COUPLET> <COUPLET><DESCR><NO>3.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>E. (Ahtola) atra</GENSP> (Walker, 1832)</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>E. nodularis</GENSP> Boheman</RESP></COUPLET> <COUPLET><DESCR><NO>4.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>Eurytoma appendigaster</GENSP> group</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>Eurytoma heriadi</GENSP> Zerova</RESP></COUPLET></KEY>
tree-like model of decisions and their possible outcomes
JATS-CON 2010
Concluding Question
How to support Publisher/Domain Specific constructs in the Archival DTD?
• Continue use of Named-Content
• New Miscellaneous Element
• Support for adding namespaced elements
• Other
JATS-CON 2010
Questions/Answers?
Thank you
John MeyerDirector of Data Technologies100 Campus Drive, Suite 100Princeton, NJ 08540609 [email protected]
JATS-CON 2010