Natural Language Processing Natural Language Processing Applied to Applied to Archival Description Archival Description of Textual E-records of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data WVU NRCCE, Morgantown, West Virginia April 20-21, 2009
26
Embed
Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Natural Language Processing Natural Language Processing Applied to Applied to
Archival Description Archival Description of Textual E-recordsof Textual E-records
William UnderwoodGeorgia Tech Research Institute
Atlanta, Georgia
WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data
WVU NRCCE, Morgantown, West VirginiaApril 20-21, 2009
OverviewOverview
Archival DescriptionMethod for extracting metadata from
textual e-recordsUse of the metadata in archival
descriptionNext Steps
Archival DescriptionArchival Description
Archival Description includes:◦The titling of records that do not have titles◦The summary of the content of records, folders
of records and series of records.◦When time allows, the creation of other finding
Archivists cannot describe a series until the record series has been manually read and reviewed. ◦ With increasing volumes of e-records, it may be decades, even
centuries, before new acquisitions are described. In responding to FOIA requests, Archivists need to be able
to search collections of e-records with high precision and recall. ◦ However, at the time of responding to FOIA requests, archivists
have not read all of the records, so cannot index the records and search on document types, dates of records, author’s and addressee’s names and the topics of records.
◦ The results set of a query is a list of file names, not record titles and summaries of content
Archival Description: Archival Description: Item Scope and Content NoteItem Scope and Content Note
Descriptions of records include names of author(s) and addressees, topics, actions and sometimes dates.
Example of an item (record) description from NARA’s Archival Research Catalog (ARC)
This letter was typewritten by President George H. W. Bush and addressed to his children: George, Jeb, Neil, Marvin, and Doro. He expresses his happiness at their Christmas celebration held at Camp David, then writes concerning his conflicted feelings as he prepares for the possibility of war with Iraq.
A Method for Extracting Metadata for A Method for Extracting Metadata for Archival DescriptionArchival Description
Input: Textual Document
1. Information Extraction2. Document Type Recognition3. Speech Act Transducer4. Discourse Analysis for Topic Recognition
Information Extraction: MethodInformation Extraction: Method
Information extraction (semantic tagging) is a technology used to identify and annotate semantic categories in text (e.g. names of persons, organizations and locations, job titles, dates).
Annotated Person Names Annotated Person Names and Job Titlesand Job Titles
Information Extraction: Information Extraction: PerformancePerformance
Document TypesDocument Types
AgendaBar ChartBiographyBriefing MemoDecision MemoCorrespondenceDiaryExecutive OrderInformation MemoJob ApplicationList of Candidates for Federal
OfficeMailing ListMemoMinutes of MeetingNational Security Directive
(NSD)
NewsletterNomination to Federal OfficeNotesPresidential StatementPress Pool ReportPress ReleaseReferral MemoResumeScheduleSignature MemoSituation ReportSummaryTranscript of SpeechTelephone Call
RecommendationTranscript of News Conference
Document Type RecognitionDocument Type Recognition
Input: Annotated text from Information Extractor
1. Intellectual Element Annotator + Intellectual Element Rules
2. SUPPLE Parser/Interpreter + Document Type Grammars augmented with Semantics