BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement Programme
Jan 19, 2016
BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA
NUSHRAT KHAN
Oxford-Illinois Digital Libraries Placement Programme
2
ABOUT EEBO-TCP- Collaboration between the universities of Oxford and
Michigan from 1999-2015
- Early English Texts between 1473-1700
- 25000 texts made available online
- Full text searching available through EEBO-TCP Database
3
WHY HISTORIC TEXTS ARE INTERESTING
Historic Datasets
Accessibility
Reveal Historical
Information
Semantic Web
Technical Interoperability
Future Research
4
WORKSET CONSTRUCTOREnables workset creation from Person, Place, Subject, Genre and Dates parameters (http://eeboo.oerc.ox.ac.uk/)
5
HOW DOES IT WORK?
Metadata extracted from TEI Data Clean Up Link the Data
Workflow of Publishing Structured Metadata
Available Metadata Fields
• Title
• Author Name
• Date (Precise Birth, Precise Death, precise-floruit-from, precise-floruit-to, precise-floruit-to)
• Raw Publication Place
• Raw Publication Date
• Publisher
6
SAMPLE PUBLISHER DATAPublisher
By Rycharde Iugge, printer to the Quenes Maiestie,
Printed by I[ohn] C[harlewood] for Iohn Hinde, dwelling in Paules Church-yarde, at the signe of the golden Hinde,
Printed by Benjamin Took and John Crook, and are to be sold by Mary Crook & Andrew Crook ...,
Printed by Peter Smith, and at Saint-Omer at the English College Press],
s.n.],
[By J. Charlewood] for Edward White, dwelling at the little North doore of Paules Church, at the signe of the Gunne,
Imprinted by Richard Field, and are to be sold by Richard Garbrand [, Oxford],
[By I. Jaggard?] for M. S[parke.,
Imprinted by E: G[riffin]: for Iohn Budge, and Ralph Mab,
By [J. King for?] Iohn waley dwellyng in Foster lane,
7
INSIDE THE DATA
Work
Printed By
Sold At
Printed For
Sold By
Printed At
?
:
.
[ ]
,
…[ ]?
“”,.
8
WORKFLOW
Data Cleaning
Named Entity Extraction (Person – Printed by, Printed for and Sold by)
Storing Triples and generate RDF
Happy Querying !
9
ENTITY RECOGNITION APPROACHES
NLTK Entity Extractor
Regular Expression
10
REVERB
For automatically identifying and extracting binary relationships from English sentences
Input Output
Argument1, Relation Phrase, Argument2Raw text
Bananas are an excellent source of potassium
(bananas, be source of, potassium)
11
OPEN CALAIS
Not as efficient on short textsi.e. Printed by A. Bells
Input text too short
Example Sentence:Printed by Melchisedech Bradwood for William Aspley
Cannot detect as a person
12
NLTK ENTITY RECOGNIZER
Step 1 Extracted all the entities labeled as PERSON for each sentence
work_000001|Rycharde Iuggework_000003|Paulswork_000004|Iohn Charlewoodwork_000004|Iohn Hindework_000005|Ioan Danterwork_000006|Francis Grovework_000007|Henry Godduswork_000008|Arthur Iohnsonwork_000012|Leonard Lichfieldwork_000013|Langly Curtiswork_000014|Benjamin Tookwork_000014|John Crookwork_000014|Mary Crookwork_000014|Andrew Crookwork_000015|William Keblewhite
All the entities NLTK can
extract for each record
(with some limitations)
13
LIMITATIONS OF NLTK
• NLTK does not identify initials as names, i.e A. B.
• Extracts only the surname in the expressions like A. Bells, Edw: Allde
• Identifies the word “Printer” in sentences where it’s mentioned in capital letters after ‘by’. i.e Printed by John Bill, Printer to the King's most Excellent Majesty
• In case of complex sentences containing multiple names it cannot detect and extract all the names efficiently
14
FINDING RELATIONSHIPS WITHIN SENTENCES("'Printed", 'JJ')('by', 'IN')(PERSON Benjamin/NNP Took/NNP)('and', 'CC')(PERSON John/NNP Crook/NNP)('and', 'CC')('are', 'VBP')('to', 'TO')('be', 'VB')('sold', 'VBN')('by', 'IN')(PERSON Mary/NNP Crook/NNP)('&', 'CC')(PERSON Andrew/NNP Crook/NNP)
Look for preceding
preposition
Separate the entities based on
‘by’ or ‘for’
“You're having a hard time because it's hard. This is really not an easy task to approach. – jonrsharpe Jul 31 '14"
15
DATA REFINING
Printed & Sold by Sold
ByPrinted
By
Printed For
De-duplicate the ‘Sold by’ Put back the
ones in ‘Printed and Sold by’
Extracted separately using ‘Regex’
16
GENERATING UNIQUE URI
Ideal case : Assign unique URI to the same person
Exception in this case:
• Few authoritative sources to refer to
• Time consuming validation
• Very limited information about each person available
Assigned unique URI to every instance
Python uuid module – uuid4() function
17
WORKING WITH ONTOLOGY
Checked existing ontologies for ‘Printed by’ and ‘Printed for’ relationships --- MODS, MADS, BibFrame etc
EEBOO Ontology
Modify the existing ontology to define the new relationships
Work
Author
Printed By
Printed For
Sold By
18
STORING TRIPLES AND GENERATING RDF
19
QUERYING ON THE DATA 1Top 20 Publishers Top 20 Printed for Top 20 Sold By
20
QUERYING ON THE DATA 2
The sellers for the works published by Henri Hills
Both Printed and Sold by Henri Hills
Sellers who worked with Henri Hills-Will Larner, Jane Underhill, Francis Smith
21
FUTURE DIRECTION
• Train NLTK to capture the names properly
• Extract specific place names from the publisher field. i.e. sold at Golden Hinde
• In case of initials figure out how to identify the names, i.e. whether R. Charles is Robert Charles or Ruth Charles etc. May be request help from domain expert
• Analyze how name expressions have changed over time
• Identify the authors using authoritative sources and domain specific knowledge, i.e. London Book Trades Index, British Book Trade Index
• Analyze and visualize the data by mapping
22
GRATITUDE
• Terhi Nurmikko-Fuller
• David M. Weigl
• Professor David De Roure
• Kevin Page
• Pip Willcox
And everybody else at OeRC!