Top Banner
The task of cleaning and enriching large collections what aspects can we share?
17

The task of cleaning and enriching large collections

Feb 24, 2016

Download

Documents

zoltin

The task of cleaning and enriching large collections. what aspects can we share?. C ontributing to this work UIUC English: Ted Underwood Jordan Sellers Mike Black UIUC Library: Harriett Green I3: Loretta Auvil Boris Capitanu Andrew W. Mellon Foundation. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The task of cleaning  and enriching  large collections

The task of cleaning and enriching

large collectionswhat aspects can we share?

Page 2: The task of cleaning  and enriching  large collections

Contributing to this work

UIUC English:Ted UnderwoodJordan SellersMike Black

UIUC Library:Harriett Green

I3:Loretta AuvilBoris Capitanu

Andrew W. Mellon Foundation

Page 3: The task of cleaning  and enriching  large collections

“Enrich” as well as “clean.”

Page 4: The task of cleaning  and enriching  large collections

Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

Page 5: The task of cleaning  and enriching  large collections
Page 6: The task of cleaning  and enriching  large collections

“representative?”

Page 7: The task of cleaning  and enriching  large collections

analyzing the data

cleaning the data

Page 8: The task of cleaning  and enriching  large collections

“clean” is relative

Page 9: The task of cleaning  and enriching  large collections

different projects will strike a different balance between

precision and recall

makes it tricky to share resources

Page 10: The task of cleaning  and enriching  large collections

Cleaning the data1. Clean up the OCR / assess error.2. Identify parts of a volume (e.g.,

articles in a serial, poetry/prose).3. Remove library bookplates and

running headers — after using them for (3).

Page 11: The task of cleaning  and enriching  large collections
Page 12: The task of cleaning  and enriching  large collections

period-specific lexica incl. foreign

lang.

collection-level observations:

proper nouns,words that appear mainly in dirty docs

context of an individual doc:

It is furely a mortal fin to ...

Correction rules

Page 13: The task of cleaning  and enriching  large collections

Cleaning/enriching the metadata

1. “18??”2. Discard duplicate volumes / select early

editions?3. Add metadata that you need for

interpretive purposes, like— gender (see Ben Schmidt’s technique),— genre.

Page 14: The task of cleaning  and enriching  large collections

first stab at genre – naive Bayes

Page 15: The task of cleaning  and enriching  large collections

Things we could shareperiod lexicons / variant spellingsgazetteers of proper nounsOCR correction rules for a perioddocument segmentation and/or cleaned and segmented textferberizationcleaned / enriched metadatacode to do all of the above

Page 16: The task of cleaning  and enriching  large collections

get clues from metadata

break vols into parts

ensemble / boosting

active learning

Page 17: The task of cleaning  and enriching  large collections

active learning: documents classified as “fiction,” plotted by confidence in classification (y axis). Red

points are misclassified.