Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG .html
63
Embed
Automated metadata creation - Possibilities and pitfalls
This program presents an overview of automated indexing and automated metadata creation, and then discuss a project completed last summer at the Florida State University Law Research Center (formerly Law Library) which used computer created metadata to index individual pages of a looseleaf resource. The program will cover an overview of machine created metadata. Internet search engines use this almost exclusively. Some library projects, and some database companies use automated indexing. The program will highlight an index and search designed to retrieve pages from a looseleaf resource as the page appeared on a specific date over a 20 year period. This search is located at www.fsulawrc.com . This project was indexed using scripting to extract most metadata. Staff then completed missing metadata fields and audited for errors. I will present on the cost-effectiveness of automated metadata creation, given error rates and costs for human and machine produced metadata, and an overall assessment of the potentials for digital library projects. The goal is to assist catalogers in knowing what is possible, what is difficult, and what is easy in using techniques for automated metadata creation. Presenter: Wilhelmina Randtke, Florida State University Libraries - Law Research Center
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automated Metadata Creation: Possibilities and Pitfalls
Presented by Wilhelmina Randtke
June 10, 2012
Nashville, Tennessee
At the annual meeting of the North American Serials Interest Group.
Materials posted at www.randtke.com/presentations/NASIG.html
• Write step-by-step instructions about how to process the Excel file
• Remember, NO DISCRETION, computers do not take well to discretion.
• Good steps:• Go to the last line of the worksheet
• Look for the letter a or A
• Copy starting from the first number in the cell, up to and including the last number in the cell.
• Bad steps:• Find the author’s name (this step needs to be broken into
small “stupid” steps)
Writing the program• Identify appropriate advisors.
• Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills.
• If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects.
• Try to find entry level material on coding.
• (Sadly, most computer programming instructions already assume you know some programming.)
• If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.
Finding Advisors: Most campus IT is about carrying heavy objects
Finding Advisors: Most campus IT is about carrying heavy objects
Perfection?
How close to perfection can you get?
Let’s run some code:
A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls
20 hours (high skill, high training) Database work
Coding for the automated indexing
35 hours (would be faster for someone with a programming background)
Automated metadata
Running script, and cleaning up metadata
35 hours (skilled staff) Automated metadata
Loading database and metadata on a server
10 hours (would be about twice as fast for someone with more database design experience)
Database work
Coding online forms to speed data entry
15 hours (skilled staff) Manual metadata
Training on documents and database design
15 hours (unskilled staff, but done before the student assistant got setup with computer forms and permissions)
Manual metadata
Metadata entry for fields the computer didn’t get
98.25 hours (unskilled staff) Manual metadata
Auditing the database against instruction sheets which went out with supplements
342.75 hours (skilled staff; includes training time for student assistant)
Auditing
Where did the time go?
Tasks and Hours
Database Work
Digitization
Auditing
Manual Metadata Creation
Automated Metadata Creation
Error ratesAutomated metadata for Supplement Number: 2.4%
Human metadata for Supplement Number: 0.8%
Automated metadata for Page Number
with systematic error: 1.0%
with the systematic error removed: 0.3%
Human metadata for Page Number: 3.1%
Error rates for the thesis indexer on GitHub: 5% - 6%
Do error rates matter?
For computer rates, might be measuring OCR.
Most metadata will be words, not numbers.
• Words are easier for a computer to pull out. Misspellings are obvious when reviewing output.
• Words are easier for a person to pull out. Less fatigue.
Recommendations
• For practitioners:
• Consider automating a process. Is it possible to index this without human involvement?
• Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it.
• For administrators
• Allow work time for this type of experimentation.
Good resources to get started• A-PDF to Excel Extractor
• A program that takes text from PDFs and puts it in Excel.
• www.a-pdf.com/to-excel/download.htm
• This is an easy start to get source material into a format you can work with.
• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka
• Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something.
• Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)
• This was good, because it doesn’t assume that you know how to code already.
• These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.)
• Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose.