NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004
Dec 19, 2015
NATIONAL LIBRARY OF MEDICINE
PubMed Central
Martha Fishel
National Library of Medicine
CENDI Meeting
September 15, 2004
NATIONAL LIBRARY OF MEDICINE
What is PubMed Central?
• Digital archive of life sciences journals• includes health policy, bioinformatics and other fields
• Participation is open to journals:• covered by a major abstracting/indexing service
• or, that have 3 editorial board members with current grants from major non-profit funding agencies
• Free access to full-text articles and supporting data
• Integrated with PubMed and other bibliographic and factual databases in NCBI’s Entrez network
NATIONAL LIBRARY OF MEDICINE
PMC Basic Policy
• Journal deposits an authoritative electronic copy that meets PMC data quality standards• full-text XML• original high-resolution graphics• PDF• supplementary data
• Journal may delay free access (up to 2 years)• research articles usually free in one year or less
• Copyright is retained by publisher or author
• Deposits – and free access permissions – are permanent• journal may stop depositing new material but may not withdraw
material already deposited
NATIONAL LIBRARY OF MEDICINE
Back Issue Digitization
• Objective: Create a complete digital archive of PMC journals back to volume 1
• Cover-to-cover digital copy of everything up to where journal began producing electronic copy
• (includes articles, covers, TOCs, advertisements and administrative matter)
• Publisher gets free, unencumbered copy
NATIONAL LIBRARY OF MEDICINE
Back Issue Digitization
• 1st set of scanned journals covered 62 titles (and title variations) for approximately 2.5 million pages
• As of September 2004, 193,436 scanned articles are included in PMC
• Starting September 2004, a new cooperative agreement with the Wellcome Trust and JISC in UK to cover an additional 1.7 million pages or more
NATIONAL LIBRARY OF MEDICINE
Titles Scanned Back to Volume 1
Antimicrobial Agents and Chemotherapy v.1,1972BMLA v. 1,1911Clinical Microbiology Reviews v. 1, 1988J of Bacteriology v.1, 1916J Clinical Investigation v.1, 1924J Clinical Microbiology v.1, 1975J Virology v. 1, 1967Molecular and Cellular Biology v.1, 1981Nucleic Acids Research v.1, 1974Texas Heart Institute Journal v.1, 1974
NATIONAL LIBRARY OF MEDICINE
Scanning Specifications
• 1-bit B&W 600 dpi G4 TIFFs
• 8-bit 300 dpi grayscale TIFF
• 24-bit 300 dpi color TIFF for illustrations
• Unedited prime OCR (5-pass engine)
• PDF with hidden text (searchable OCR “hidden behind” pg. images)
NATIONAL LIBRARY OF MEDICINE
• Secure permission to digitize• Acquire disposable content
• Donor sources include publishers, associations, individuals
• Create issue-level inventory • Prepare content for digitization – create journal
style sheets• Pack and ship materials
Back Issue Scanning Tasks
NATIONAL LIBRARY OF MEDICINE
QA Tasks
• Receive deliverables (DVDs) at NLM (NCBI)
• Run automated QA programs
• Mark random issues from each title for manual QA
• Perform QA (NLM contractor compares digitized image to original volumes pulled from NLM shelves)
• Accept or Reject a batch based on rigid criteria
NATIONAL LIBRARY OF MEDICINE
QA Criteria
• XML Character and Tag accuracy – 99.95%• Inventory Error – 100%• Image quality – 100%
• Distortion• Color• Visible pixilation
• OCR quality – 100% for completeness and zoning (unedited)
• PDF sequence and source accuracy – 100%
NATIONAL LIBRARY OF MEDICINE
Sample Batch Status ReportXML # of samples
# of samples failed
% failed Status
Tag content accuracy (99.95%)
5135 tags 12 tags 0.23% Failed
PDF # of samples# of samples failed
% failed Status
Visible skew (99%)
675 pages 1 page 0.15% OK
OCR file # of samples# of samples failed
% failed Status
Full Page Image
# of samples# of samples failed
% failed Status
Illustration Image
# of samples# of samples failed
% failed Status
Other # of samples# of samples failed
% failed Status
Recommended action
Reject
NATIONAL LIBRARY OF MEDICINE
Final data Preparation
• Update Inventory database indicating issues returned• Format TIFFs, PDFs, Organize journal parts• Load to Preview Site for publisher review• Load to Live PMC site• Retain indefinitely!
NATIONAL LIBRARY OF MEDICINE
Progress to Date
25,000: Issues received
1.8 million: pages scanned
156,000: XML Citations created
NATIONAL LIBRARY OF MEDICINE
Challenges To Date
• Locating old, rare copies in good condition• Scanning and delivering fill-in pages at NLM• Feeding the pipeline • Maintaining even workflow at NLM• Quality Assurance (understanding requirements)