Automated metadata creation - Possibilities and pitfalls

Automated Metadata Creation: Possibilities and Pitfalls

Presented by Wilhelmina Randtke

June 10, 2012

Nashville, Tennessee

At the annual meeting of the North American Serials Interest Group.

Materials posted at www.randtke.com/presentations/NASIG.html

Teaser: Preview of the sample project.

http://www.fsulawrc.com

http://www.fsulawrc.com/

Background: What is “metadata”?

Metadata = any indexing information

Examples:

MARC records

color, size, etc. to allow clothes shopping on a website

writing on the spine of a book

food labels

What we'll cover Automated indexing:

Human vs machine indexing Range of tools for automated metadata creation:

Techy and less techy. Sample projects

A little background on relational databases Database design for a looseleaf (a resource that

changes state over time). Sample project: The Florida Administrative

Code 1970-1983

Automated Indexing: What’s easy for computers?

Computers like black and white decisions.

Computers are bad with discretion.

Word search vs. Subject headings

One Trillion

1,000,000,000,000

webpages indexed in Google

… 4 years ago …

Nevertheless…

… Human indexing is alive and well

How to fund indexing?

http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress



Who made the metadata:Human or Machine?

How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

Not automated indexing, but a related concept….

Always try to think about

how to reuse existing metadata.

High Tech automated metadata creation

The high end: Assigning subject headings with computer code

Some technologies:

• UIMA (Unstructured Information Management Architecture)

• GATE (General Architecture for Text Engineering)

• KEA (Keyphrase Extraction Algorithm)

Computer Program for Automated Indexing

OntologyThesaurus

Person’s role:Select an appropriate

ontology.Configure the

program so that it’s looking at outside sources.

Review the results and make sure the assigned subject headings are good.

Program’s role:Take ontology or

thesaurus and apply it to each item to give subject headings.

Subject Headings

Item

http://www.nzdl.org/Kea/examples1.html

http://www.nzdl.org/Kea/examples1.html

The lower end: Deterministic fields

There’s an app for that

Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot

https://github.com/ao5357/thesisbot

Batch OCR

Many tools exist to extract text from PDFS to Excel

Walkthrough – examining the extracted spreadsheets

http://fsulawrc.com/excelVBAfiles/index.html

http://fsulawrc.com/excelVBAfiles/index.html

How to plan the program• Look for patterns

• Write step-by-step instructions about how to process the Excel file

• Remember, NO DISCRETION, computers do not take well to discretion.

• Good steps:• Go to the last line of the worksheet

• Look for the letter a or A

• Copy starting from the first number in the cell, up to and including the last number in the cell.

• Bad steps:• Find the author’s name (this step needs to be broken into

small “stupid” steps)

Writing the program• Identify appropriate advisors.

• Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills.

• If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects.

• Try to find entry level material on coding.

• (Sadly, most computer programming instructions already assume you know some programming.)

• If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.

Finding Advisors: Most campus IT is about carrying heavy objects

Finding Advisors: Most campus IT is about carrying heavy objects

Perfection?

How close to perfection can you get?

Let’s run some code:

A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls

Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx

The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com

http://fsulawrc.com/excelVBAfiles/23batch6A.xls

http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx

http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx

http://www.fsulawrc.com/

How much metadata was missing?

Field Number of empty fields(27,992 fields total, after preliminary removal of blank pages)

Percent of Field filled

Chapt. No before dash 183 99.3%

Chapt no after dash 2179 92.2%

Page no. 1766 93.6%

Supp no (ie. Date page went into the looseleaf)

3242 88.4%

Replacing supplement (ie. Data page was removed from the looseleaf)

All (however, 105 fields were entered manually in order to demonstrate the interface and get funding for manual metadata creation)

0%

Cheap and fastand incomplete

This is a search engine build on an index for the automated metadata only:

http://fsulawrc.com/automatedindex.php

It’s better than a shuffled pile of 30,000 pages.

It’s not very good.

If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.

http://fsulawrc.com/automatedindex.php

Filling in the gapsCode helps speed workflow, but still time consuming.

http://fsulawrc.com/phptest/chaptbeforedashfill.php

This is editing a copy of the automated metadata database. You can enter as much as you like, and not break anything.

http://fsulawrc.com/phptest/chaptbeforedashfill.php

Last step: Auditing for missing pages, by comparing instruction sheets that went out with

supplements

www.fsulawrc.com/supplementinstructionsheets.pdf

http://www.fsulawrc.com/supplementinstructionsheets.pdf

Task Hours spent Category of work

Inspecting looseleaf and planning a database

20 (high skill, high training) Database work

Digitization with sheetfed scanner

35 (low skill, low training) Digitization

Planning the code for automated indexing

20 hours (high skill, high training) Database work

Coding for the automated indexing

35 hours (would be faster for someone with a programming background)

Automated metadata

Running script, and cleaning up metadata

35 hours (skilled staff) Automated metadata

Loading database and metadata on a server

10 hours (would be about twice as fast for someone with more database design experience)

Database work

Coding online forms to speed data entry

15 hours (skilled staff) Manual metadata

Training on documents and database design

15 hours (unskilled staff, but done before the student assistant got setup with computer forms and permissions)

Manual metadata

Metadata entry for fields the computer didn’t get

98.25 hours (unskilled staff) Manual metadata

Auditing the database against instruction sheets which went out with supplements

342.75 hours (skilled staff; includes training time for student assistant)

Auditing

Where did the time go?

Tasks and Hours

Database Work

Digitization

Auditing

Manual Metadata Creation

Automated Metadata Creation

Error ratesAutomated metadata for Supplement Number: 2.4%

Human metadata for Supplement Number: 0.8%

Automated metadata for Page Number

with systematic error: 1.0%

with the systematic error removed: 0.3%

Human metadata for Page Number: 3.1%

Error rates for the thesis indexer on GitHub: 5% - 6%

Do error rates matter?

For computer rates, might be measuring OCR.

Most metadata will be words, not numbers.

• Words are easier for a computer to pull out. Misspellings are obvious when reviewing output.

• Words are easier for a person to pull out. Less fatigue.

Recommendations

• For practitioners:

• Consider automating a process. Is it possible to index this without human involvement?

• Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it.

• For administrators

• Allow work time for this type of experimentation.

Good resources to get started• A-PDF to Excel Extractor

• A program that takes text from PDFs and puts it in Excel.

• www.a-pdf.com/to-excel/download.htm

• This is an easy start to get source material into a format you can work with.

• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka

• Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something.

• www.techbookreport.com/tutorials/excel_vba1.html

• For more advanced instructions, use a search engine to read message boards.

http://www.a-pdf.com/to-excel/download.htm

http://www.techbookreport.com/tutorials/excel_vba1.html

• An eHow instructions telling you how to turn on the Developer Ribbon in Excel 2007

• http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html

(use these same instructions for Excel 2010; older versions of Excel have the developer ribbon turned on by default)

• How to get to the tab where you can do simple coding.

• How to Build a Search Engine

• http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012

• Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)

• This was good, because it doesn’t assume that you know how to code already.

Good resources to get started

http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html

http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html

http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012

http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012

• Wikipedia section on string processing algrithms.

• http://en.wikipedia.org/wiki/String_%28computer_science%29#String_processing_algorithms

• These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.)

• Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose.

• Wikipedia page on relational databases

• http://en.wikipedia.org/wiki/Relational_database

• It will be useful for you to understand primary keys, foreign keys, and tables referencing each other.

Good resources to get started

http://en.wikipedia.org/wiki/String_(computer_science)

http://en.wikipedia.org/wiki/String_(computer_science)

http://en.wikipedia.org/wiki/Relational_database



June 10, 2012




Special thanks to:

Jason Cronk

Anna Annino



June 10, 2012




Automated metadata creation - Possibilities and pitfalls

Education

automated indexing

human indexing

indexing program

indexing informationexamples

existing metadata

fields total

deterministic fields

fields werefrom