4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007 From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools Chu-Ren Huang Academia Sinica http:// cwn.ling.sinica.edu.tw/huang/huang.htm
76
Embed
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007 From Synergy to Knowledge: Integrating multiple language resources Part.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Synergy to Knowledge: Integrating multiple language resources
Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary )
p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
What CLP Development Showed? Resources Lead
When tools and standards completes a comprehensive infrastructure
Research will bloom
p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Resources Development Towards a Sharable and Sustainable Model of Resou
rces Development
OLAC
Open Language Archives Community
http://www.language-archives.org
p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC AimsOLAC, the Open Language Archives Community, is an
international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:
developing consensus on best current practice for the digital archiving of language resources;
developing a network of interoperating repositories and services for housing and accessing such resources.
p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC OrganizationCoordinators: Steven Bird & Gary Simons
Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive)
p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 2: OAI Service Providers and Data Providers
p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 3: OLAC & OAIRecall: OAI data providers must support:
Dublin Core Metadata
OAI Metadata harvesting protocol
BUT: OAI data providers can support:
a more specialized metadata format
a more specialized harvesting protocol
What OLAC does:
specialized metadata for language resources
specialized harvesting (extra validation)
p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC StandardsAside:
standards = the protocols and interfaces that allow the community to function
recommendations = "standards" for representing linguistic content
OLAC has three primary standards:
OLACMS: the OLAC Metadata Set (Qualified DC)
OLAC MHP: refinements to the OAI protocol
OLAC Process: a procedure for identifying Best Common Practice Recommendations
p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The OLAC Metadata Set
The three categories of metadata: Work language: describes information entities and
their intellectual attributes e.g. names of works and their creators
Document language: describes and provides access to the physical manifestation of information
e.g. format, publisher, date, rights Subject language: describes what a document is
about e.g. subject, description
p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLACMS and Controlled Vocabularies
Language:
A language of the intellectual content of the resource (OLAC-Language)
Subject.language:
A language which the content of the resource describes or discusses (OLAC-Language)
OLAC-Language:
A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process
p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
Summary: With the software in place, we have a complete platform
OAI
CONTENT METADATA
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources
OAI
CONTENT METADATA
OLAC REPOSITORIES
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
OLAC
OAI
CONTENT METADATA
OLAC REPOSITORIESOLAC SERVICES
USER SERVICES
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy)
p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLACMS helps archive versatility
Given Shared Metadata Standard
New language archives can be created on the fly by harvesting existing archives
Rich information can be inferred by establishing temporal and geographic anchors for each document.
p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC Infrastructure
Helps to Solve Language Archive Problems such as
Language Identification
and
Metadata Set for Multi-lingual Language Archives
p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Language Identification ProblemThe DC code (e.g. ‘en’ for English) is not enough to describe all th
e languages in the world
Enthnologue (http://www.ethnologue.org) is comprehensive but not complete
Potential Problems of using Enthnologue (or any existing language list)
p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A Fundamental Solution to Language Identification Problems
Registering language groups with an OLAC registration service
OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue code
s)
AS:Amis = {ALV, AIS}
ALV= Amis, AIS= Nataoran
p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Describing Multi-Lingual Resources in OLACMS
Directionality is crucial in multilingual resources
However, OLAC metadata is flat and unordered
Bi-directional MT
<Language code= X/>
<Language code= Y/>
<Subject.language code= X/>
<Subject.language code= Y/>
p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Multi-lingual Resources IIText: language
Bitext (bilingual aligned corpus) There is always an directionality
Original: language
Translation: Subject.language
Language Description (Field Notes) Elicitation, transcription, translation, notes
Multiple related resources
p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Language Archives Project of Taiwan Part of the National Digital Archives Project (NDAP)
Pilot Stage 2000-2001
First Phase: 2002-2006
Both Language Archives
And Linguistic Anchor
p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Language and Digital Archives
WWhheerree HHiissttoorriiccaall MMaappss
LLaanngguuaaggee CChhaannggeess
LLaanngguuaaggee VVaarriiaattiioonnss
LLaanngguuaaggee
WWhheenn
Digital Archives
HHooww aanndd WWhhaatt
p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Digital Archives are Linguistically Anchored
• ArchiveArchives are s are anchored with Lexical KnowledgeBase anchored with Lexical KnowledgeBase (LKB)(LKB)
-because LKB as collection of lexical types instantiated in ar-because LKB as collection of lexical types instantiated in archives uniquely defines each archivechives uniquely defines each archive
-And each lexical item is the conceptual atom projecting kno-And each lexical item is the conceptual atom projecting knowledge from archive to archivewledge from archive to archive
p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Multi-anchor Knowledge Linking Geographical anchor based on GIS (geography
information system)
-Ecology (Fauna, Weather, Geology etc.)
-Socio-Anthropological classification
Linguistic anchor based on LKB
-etymology, language grouping, loan words,
p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Institute of Linguistics
Language Archives
p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Two branch projects :
1 Chinese Archives -- 5 sub-projects :• Early- Mandarin Chinese Lexicon
• Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts
• Modern Chinese Corpus and Treebank• New Age Corpus: Linguistic Representations and Archi
ves of Multimedia Data
• Southern-Min Archive: A Database of Historical Change
in Language Distribution
2 Formosan Language Archives.
p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
GOAL :
1. Collect the corpus and the lexicon in the period of Early Mandarin Chinese.
2. Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development.
Archives Description :
1. Digitalization of texts (10,000,000 characters).
2. Tagging of grammatical markers (3,500,000 characters).
3. Construction of the lexical database.
http:www.sinica.edu.tw/Early_Mandarin
Early- Mandarin Chinese Lexicon
p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Archives Description :• to digitize the bronze inscriptions from the Shang to the
Eastern Chou dynasties.• the construction of a typological lexicon of bronze inscri
ptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts.
Achievement : • Proof-read bronze inscriptions (12113 piece of bronze in
scriptions).
http://Inscription.sinica.edu.tw
Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts