Page 1
IBM Haifa Research Lab
© 2014 IBM Corporation
PreMapper:Improving Entity Extraction Accuracy in theDigital Humanities
Cormac Hampson ([email protected] )Ella Rabinovich ([email protected] ), Sara Porat ([email protected] )Maya Koleva ([email protected] ), Ivan Uzunov ([email protected] )Owen Conlan ([email protected] )
Page 2
IBM Haifa Research Lab
© 2014 IBM Corporation2
What is CULTURA
• Digital humanities portal supporting the exploration of cultural heritage collections by a range of different users
• Professional researchers and historians
• Students with little or no experience of a particular archive
• There are three digitised collections in the portal
• 1641 Depositions (http://cultura-project.eu/1641/)
• Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916)
• IPSA Collection (http://cultura-project.eu/ipsa)
Page 3
IBM Haifa Research Lab
© 2014 IBM Corporation3
Smart Content Analysis with Entity-Relationship Extraction
• A powerful technique for injecting semantics into unstructured text
• Employing Natural Language Processing (NLP)
• Involving training a dataset and/or using prior knowledge (e.g., dictionaries) so that specific entities can be identified within the text
• Each collection introduces its unique entity-relationship model
• Entities, e.g., Person, Location, Event
• Entity attributes, e.g., Person.occupation, Deposition.mentioned_date
• Relation between entities, e.g., Person at Location
Page 4
IBM Haifa Research Lab
© 2014 IBM Corporation4
Entity Extraction – Example
<title> first-name last-name
sir Robert Andrew
Page 5
IBM Haifa Research Lab
© 2014 IBM Corporation5
Manual Updates of Extracted Entities - Motivation
• The automatic task of entity extraction cannot provide full accuracy
• The 1641 Depositions collection introduces additional difficulty due to the noisy text, inconsistent grammar and spelling
• Extraction errors can damage a curator’s trust in the automatic processing, as well as an end user’s overall confidence in the system
• Approaches to improve the accuracy of entity extraction are of major benefit of the CULTURA environment
Page 6
IBM Haifa Research Lab
© 2014 IBM Corporation6
Entities Visualisation and Modification with PreMapper
• PreMapper is a web-based visualization and analysis tool that is integrated into the CULTURA environment
• Provides visualisation and editing of entities, maps, flows and relationships between individuals and groups
• Entities (people, organizations) are represented by nodes, links present relationships between these nodes
Page 7
IBM Haifa Research Lab
© 2014 IBM Corporation7
Manual Changes of Extracted Entities
PreMapper enables curators of the collection to make manual changes to
the extracted entities using a GUI
• Add/delete/update entity
• Merge two entities into a single entity (entities disambiguation)
• Add/delete relationship between entities
The entity “sir phelim” can be merged with theentity “phelim neil” if an expert deems that theseentities refer to the same person
Page 8
IBM Haifa Research Lab
© 2014 IBM Corporation8
General Flow
Page 9
IBM Haifa Research Lab
© 2014 IBM Corporation9
Entity Disambiguation via PreMapper
• The task of determining the identity of entities mentioned in the text
• e.g., based on entity’s key attributes
• Entity disambiguation in historical content is one of the main challengesof CULTURA professional users
• Are “sir Phelim” and “Phelim o neil” the same person?
• Are “Rob. Meredith” and “Robert Meredith” the same person?
• Entities scope matter (disambiguation of entities found in the same deposition vs. entities found in different depositions)
• Non-functional challenges
• Authorization – who is allowed to make changes?
• Personalization – what is the scope of a specific change (specific researcher, group of researchers, the entire professional community)?
• Verification – who verifies the changes?
Page 10
IBM Haifa Research Lab
© 2014 IBM Corporation10
Summary and Future Work
• Entity-relationship extraction is a powerful technique for extracting structured information from unstructured documents
• PreMapper is a visualization tool that allows domain experts to improve the accuracy of the entity-relationship data
• Domain experts feedback is important in refining the user interfacewith the CULTURA environment
• It becomes vital when entity extraction is error-prone, as with the 1641 Depositions collection that contains a lot of noise and misspellings
• Future work includes further exploration and design of the fullyintegrated end-to-end solutionhttp://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId=7&layout=circle#svg-graph-editor-switch
Page 11
IBM Haifa Research Lab
© 2014 IBM Corporation11