Top Banner
Natural Language Processing Natural Language Processing Applied to Applied to Archival Description Archival Description of Textual E-records of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data WVU NRCCE, Morgantown, West Virginia April 20-21, 2009
26

Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Mar 26, 2015

Download

Documents

Jasmine O'Neil
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Natural Language Processing Natural Language Processing Applied to Applied to

Archival Description Archival Description of Textual E-recordsof Textual E-records

William UnderwoodGeorgia Tech Research Institute

Atlanta, Georgia

WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data

WVU NRCCE, Morgantown, West VirginiaApril 20-21, 2009

Page 2: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

OverviewOverview

Archival DescriptionMethod for extracting metadata from

textual e-recordsUse of the metadata in archival

descriptionNext Steps

Page 3: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Archival DescriptionArchival Description

Archival Description includes:◦The titling of records that do not have titles◦The summary of the content of records, folders

of records and series of records.◦When time allows, the creation of other finding

aids such as subject indexes to record series.

Page 4: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Archival Description:Archival Description:Research MotivationResearch Motivation

Archivists cannot describe a series until the record series has been manually read and reviewed. ◦ With increasing volumes of e-records, it may be decades, even

centuries, before new acquisitions are described. In responding to FOIA requests, Archivists need to be able

to search collections of e-records with high precision and recall. ◦ However, at the time of responding to FOIA requests, archivists

have not read all of the records, so cannot index the records and search on document types, dates of records, author’s and addressee’s names and the topics of records.

◦ The results set of a query is a list of file names, not record titles and summaries of content

Page 5: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Archival Description: Archival Description: Item Scope and Content NoteItem Scope and Content Note

Descriptions of records include names of author(s) and addressees, topics, actions and sometimes dates.

Example of an item (record) description from NARA’s Archival Research Catalog (ARC)

This letter was typewritten by President George H. W. Bush and addressed to his children: George, Jeb, Neil, Marvin, and Doro. He expresses his happiness at their Christmas celebration held at Camp David, then writes concerning his conflicted feelings as he prepares for the possibility of war with Iraq.

Page 6: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

A Method for Extracting Metadata for A Method for Extracting Metadata for Archival DescriptionArchival Description

Input: Textual Document

1. Information Extraction2. Document Type Recognition3. Speech Act Transducer4. Discourse Analysis for Topic Recognition

Output: [document(e1), author(e1, S), addressee(e1, H), act(e1 F(P)), topic(e1, T), date(e1, D)]

Page 7: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Information Extraction: MethodInformation Extraction: Method

Information extraction (semantic tagging) is a technology used to identify and annotate semantic categories in text (e.g. names of persons, organizations and locations, job titles, dates).

1. Document Reader2. English Tokenizer3. Wordlist Lookup + enhanced wordlists4. Sentence Splitter 5. Hepple POS Tagger + lexicon6. Semantic Tagger + Named Entity Rules

Page 8: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Information Extraction: Information Extraction: Wordlist LookupWordlist Lookup

Person_female_first.lst (8263)Person_female_first_ambig.lst (117)Person_male_first.lst (3704)Person_male_first_ambig.lst (1,117)Person_surname.lst (83,805)Person_surname_ambig.lst (6,802)Person_headofstate_90.lst (478)Location_city_US.lst (33,017)Location_city_us_ambig.lst (5,478)Location_foreign_city.lst (3802)

Page 9: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Java Annotation Pattern Engine Java Annotation Pattern Engine (JAPE) Rules(JAPE) Rules

Page 10: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Annotated Person Names Annotated Person Names and Job Titlesand Job Titles

Page 11: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Information Extraction: Information Extraction: PerformancePerformance

Page 12: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Document TypesDocument Types

AgendaBar ChartBiographyBriefing MemoDecision MemoCorrespondenceDiaryExecutive OrderInformation MemoJob ApplicationList of Candidates for Federal

OfficeMailing ListMemoMinutes of MeetingNational Security Directive

(NSD)

NewsletterNomination to Federal OfficeNotesPresidential StatementPress Pool ReportPress ReleaseReferral MemoResumeScheduleSignature MemoSituation ReportSummaryTranscript of SpeechTelephone Call

RecommendationTranscript of News Conference

Page 13: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Document Type RecognitionDocument Type Recognition

Input: Annotated text from Information Extractor

1. Intellectual Element Annotator + Intellectual Element Rules

2. SUPPLE Parser/Interpreter + Document Type Grammars augmented with Semantics

3. Extract MetadataOutput: [document(e1), author(e1, S),

addressee(e1, H), topic(e1, T), date(e1, D)]

Page 14: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Document Types:Document Types:Intellectual Element RecognitionIntellectual Element Recognition

Page 15: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Document Types: Grammar for the Document Types: Grammar for the Structure of a MemorandumStructure of a Memorandum

Page 16: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Document Types: Grammar for Document Types: Grammar for Memorndum with Semantic RulesMemorndum with Semantic Rules

Page 17: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Parse Tree and Semantics Parse Tree and Semantics of a Documentof a Document

Page 18: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Extracted Metadata andExtracted Metadata andItem DescriptionItem Description

Document_Type = memoDate = April 27, 1992Author = SAM SKINNERAddressee = EDE HOLIDAYTopic = California

Earthquake

A memorandum dated April 27, 1992 from EDE Holiday to Sam Skinner regarding California Earthquake.

Page 19: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Speech Act TransducerSpeech Act Transducer

1. Annotation of Explicit Speech Acts2. Annotation of Implicit Speech Acts3. Annotation of Speech Acts Indicated by Text

Structure4. Annotation of Indirect Speech Acts5. Annotation of the Primary Speech Acts

Page 20: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Speech Acts

Performative verb - Verb whose action is accomplished merely by saying it or writing it.

I recommend that you attend the conference. Illocutionary force of a message.

recommend Propositional content of a message

you attend the conference An explicit performative sentence is a sentence in which the

illocutionary force is made explicit by naming the force.I promise to be there

An implicit performative sentence is a sentence in which the illocutionary force is not made explicit by naming the force.

I shall be there

Page 21: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Speech Acts: Implicit

Declarative, imperative and interrogative sentences also express speech acts.

Declarative (state)◦You completed the report.

Imperative (request)◦Please, complete the report.

Interrogative (ask)◦Did you complete the report?

Page 22: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Speech Acts

An indirect speech act is a speech act that is performed indirectly by way of performing another.

Can you pass the salt? (ask)in the appropriate context means

Please, pass the salt. (request)Textual structure can also indicate

illocutionary force.Example: a section heading RECOMMENDATIONS can

indicate the sentences in a section have the illocutionary force recommend.

Page 23: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Speech Acts in Presidential Speech Acts in Presidential RecordsRecords

assert, deny, state, declare(1), tell(1), report, advise(1), remind, inform, certify(1), agree(1), acknowledge, praise(1), commit, pledge, direct, request, ask(1), ask(2), urge, encourage, invite, order(1), prohibit, suggest(2), propose, recommend, declare(2), resign, confirm, nominate, appoint, authorize, pray, terminate, veto, approve(1), disapprove, revoke, mourn, congratulate, thank, apologize, and welcome(2).

concur, salute, amend, counsel, welcome(1), tender(2), call on, block, retire, proclaim, delegate, designate, determine, find, reject(2), endorse, appreciate, regret, trust(1) , believe, want, desire, and intend.

Page 24: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Uses of Extracted Metadata Uses of Extracted Metadata in Automatic Descriptionin Automatic Description

Signature Memorandum from Boyden Gray to the President recommending the nomination of Ronald B. Leighton to be a US District Judge.

Letter from President Bush to President Mikhail Gorbachev suggesting an informal meeting.

Memorandum from President Bush to Boyden Gray requesting an analysis of the War Powers Resolution.

Letter from Susan Black to President Bush expressing appreciation for nomination and commitment to serve.

Referral Memorandum from Sally Kelley to FEMA requesting appropriate action to a letter from Beryl Anthony to the President.

Page 25: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Next StepsNext Steps

Inducing grammars for documentary form from samples

Create rules for annotating implicit speech acts and speech acts indicated by textual structure.

Evaluate performance of Speech act recognition method

Recognition of the topics of sentencesDiscourse Analysis to identify primary topic(s) of

recordsGenerate item, folder and series descriptions and

evaluate the method

Page 26: Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA.

Additional InformationAdditional Information

Website: perpos.gtri.gatech.edu

W. Underwood and S. Isbell, Semantic Annotation of Presidential E-Records, Technical Report ITTL/CSITD 08-01, May 2008

W. Underwood and S. Laib. Automatic Recognition of Documentary Forms, Technical Report ITTL/CSITD 08-02, May 2008

W. Underwood. Recognizing Communication Acts in Presidential E-Records. Technical Report ITTL/CSITD 08-03, October 2008