1 Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland , November 2006 Nicoletta Calzolari Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected]An Infrastructure of An Infrastructure of Language Resources & Language Resources & Language Technologies: Language Technologies: Why we need it? Why we need it? Priorities & Priorities & Challenges Challenges
31
Embed
1Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006 Nicoletta Calzolari Istituto di Linguistica Computazionale -
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Nicoletta Calzolari Nicoletta Calzolari
Istituto di Linguistica Computazionale - CNR - Pisa
Language Technologies:Language Technologies:Why we need it?Why we need it?
Priorities & Priorities & ChallengesChallenges
2Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
What are we (LT& LR) What are we (LT& LR) assembling, …. since many assembling, …. since many years?years? Lexicons & their OntologiesLexicons & their Ontologies
TOP ConceptsTOP Concepts: Object,Artifact,BuildingObject,Artifact,Building
WordNetsWordNetsSynsets linked by semantic relationsSynsets linked by semantic relations
6Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Terminological Wordnets: Terminological Wordnets:
e.g. e.g. JurJur--WordNetWordNet
JurJur-WordNet-WordNet EExtension for the xtension for the juridical domainjuridical domain
of ItalWordNet of ItalWordNet (With ITTIG-CNR - Istituto di Teoria e Tecniche dell’Informazione Giuridica)(With ITTIG-CNR - Istituto di Teoria e Tecniche dell’Informazione Giuridica)
Knowledge base for multilingual access to sources of legal Knowledge base for multilingual access to sources of legal informationinformation
Source of metadata for semantic markup oflegal textsSource of metadata for semantic markup oflegal texts
To be used, together with the generic ItalWordNet, in To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc.etc.
7Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
PAROLE- SIMPLE-CLIPS Lexicon: PAROLE- SIMPLE-CLIPS Lexicon: …harmonised model for 12 European …harmonised model for 12 European
languageslanguages
8Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
In the ’90sIn the ’90s:: there was a global vision of the field & its main there was a global vision of the field & its main components: components: Standards, Creation of LRs, Automatic Standards, Creation of LRs, Automatic
acquisition, Distributionacquisition, Distribution TodayToday: the wealth of data & basic technology is such that we should : the wealth of data & basic technology is such that we should
reflect again at the field as a whole & ask ifreflect again at the field as a whole & ask if these these are still “the” are still “the” important components, or how they have changed/must changeimportant components, or how they have changed/must change
… … Which new challenges for Which new challenges for a a
mature infrastructure of mature infrastructure of LRs & LT??LRs & LT??
12Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Basic LR coverage for all languagesBasic LR coverage for all languages ((BLARK/ELARKBLARK/ELARK) Specific (new) types of LRs: Specific (new) types of LRs: opinion, sentiment, emotion, opinion, sentiment, emotion,
subjectivitysubjectivity;; ““Example-based” context sensitive LRs,Example-based” context sensitive LRs, Lexicon & Corpus Lexicon & Corpus
togethertogether, dynamically created, dynamically created, new ways to extract value new ways to extract value from large linguistic repositories : from large linguistic repositories : Web exploited as a Web exploited as a multilingual corpusmultilingual corpus
Tools to quickly develop LRs Tools to quickly develop LRs (acquisition, annotation, porting (acquisition, annotation, porting betw. domains/languages);betw. domains/languages); Coordinate the development of LTs & Coordinate the development of LTs & LRs (also across languages)LRs (also across languages)
Knowledge transfer across languages; Maintenance of Knowledge transfer across languages; Maintenance of
LRsLRs
Cooperation betw. communities of HLT & Semantic Cooperation betw. communities of HLT & Semantic Web/OntologistsWeb/Ontologists
'Open Source''Open Source' concept for LRs & LT, Open & concept for LRs & LT, Open & distributed architectures for LRs and LT, distributed architectures for LRs and LT, wiki-mode?wiki-mode? Collaborative Infrastructures Collaborative Infrastructures Interoperability & Interoperability & StandardsStandards
GRID technologyGRID technology ……
Challenges & Priorities for Challenges & Priorities for LRsLRswith technological and/or
organisational/political aspects
Multilingua
Multilingua
litylityUnifying
frameworks
13Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Subjectivity, opinion, sentiment, emotionSubjectivity, opinion, sentiment, emotion: orthogonal issue wrt objective content. Detection /separation of Detection /separation of subjective from objective contentsubjective from objective content, opinion mining, extraction of positive & negative perceptionspositive & negative perceptions, have obvious and big impactbig impact in many applications, e.g. business intelligence
Commonsense understandingCommonsense understanding with major implications allow commonsense reasoning/inference: plausible vs allow commonsense reasoning/inference: plausible vs
logical, for fail-soft applicationslogical, for fail-soft applications can be pursued in distributed and collaborative fashion can be pursued in distributed and collaborative fashion
by the community as a wholeby the community as a whole relation of this with how an agent might put together relation of this with how an agent might put together
SW services to accomplish high–level goals for the userSW services to accomplish high–level goals for the user Temporal structureTemporal structure for which de facto standards are
emerging (TimeMLTimeML) Integration of text, speech and gestureIntegration of text, speech and gesture Strategies for handling miscommunicationhandling miscommunication Hybrid approaches, Interdisciplinary approachesHybrid approaches, Interdisciplinary approaches …
LT & “new” topics
MultimodalMultimodal
ityity
14Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
In the Semantic Web In the Semantic Web vision ...vision ...
……need to tackle the twofold challenge of need to tackle the twofold challenge of content availabilitycontent availability && multilingualitymultilinguality
15Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Issues in LR & LT research agendaIssues in LR & LT research agendaconverging with Semantic Web converging with Semantic Web
needsneedsFrom LT:From LT:
Meaning & content Meaning & content Knowledge Knowledge Semantic markup: Semantic markup: Concept-based Text Concept-based Text
open accessopen access Interoperability & standardsInteroperability & standards
to to add meaning to Web dataadd meaning to Web data & make it & make it usable for processing, mining, add spatial & usable for processing, mining, add spatial & temporal metadata, …temporal metadata, …
16Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Computational Lexicons:Computational Lexicons: challenges from the Semantic Webchallenges from the Semantic Web
Semantic Web
The The Semantic Web VisionSemantic Web Visionturning the WWW into turning the WWW into
a machine understandable knowledge basea machine understandable knowledge base
Ontologies
KnowledgeMarkup
IntelligentAgents
Applications
Documents
Databases
ComputationalLexicons
LinguisticMarkup
17Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Language/sLanguage/s
OntologiesOntologies and and Computational LexiconsComputational Lexicons
ConceptConceptSpaceSpace
ConceptConceptSpaceSpace
OntologyOntology
ComputationalComputationalLexiconLexicon
SemanticsSyntax
MorphologyMultilinguality
polysemy, context-sensitiveness,
etc.
18Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
term extraction from textterm extraction from text
22Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
A new paradigm of R&D in LRs & A new paradigm of R&D in LRs & LTLT
Open & distributed linguistic infrastructures for Open & distributed linguistic infrastructures for LRs & LTLRs & LT
adopting the paradigm of adopting the paradigm of accumulation of accumulation of knowledgeknowledge so successful in more mature so successful in more mature disciplines, based on sharing LRs & toolsdisciplines, based on sharing LRs & tools
ability to build on each other achievements, results ability to build on each other achievements, results accessible to various systems, allowing controlled accessible to various systems, allowing controlled & & effective cooperation of many groups on effective cooperation of many groups on common tasks common tasks (see HGP (see HGP HLPHLP))
Emerging concept of collective intelligenceEmerging concept of collective intelligence
Emphasize Emphasize interoperabilityinteroperability among LRs, LT & among LRs, LT & knowledge basesknowledge basese. g. initiatives aimed at achieving international e. g. initiatives aimed at achieving international consensus on annotation guidelines: consensus on annotation guidelines: to merge to merge annotation efforts, produce coherent, comprehensive annotation efforts, produce coherent, comprehensive linguistic annotations to be readily disseminated throughout linguistic annotations to be readily disseminated throughout the communitythe community
23Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
ISO & LIRICS: ISO & LIRICS: Meta-model & Data Meta-model & Data
CategoriesCategoriese.g. Proposal for an ISO standard for NLP lexicae.g. Proposal for an ISO standard for NLP lexica
Define a Define a Lexical Markup FrameworkLexical Markup Framework, a general & abstract meta-, a general & abstract meta-model & a set of structural nodes relevant for linguistic descriptionmodel & a set of structural nodes relevant for linguistic description
Define a flexible environment, enabling specific implementations of Define a flexible environment, enabling specific implementations of user-defined mark-up languages (called LML) on the basis of user-defined mark-up languages (called LML) on the basis of common DCscommon DCs
ObjectivesObjectives Design of the abstract lexical meta-modelabstract lexical meta-model Definition of the common setcommon set of related Data CategoriesData Categories
The field is mature
The field is mature
Builds also on Builds also on EAGLES/ISLEEAGLES/ISLE
24Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
MILE Lexical ModelMILE Lexical Model
Data Categories for Content Data Categories for Content InteroperabilityInteroperability
e-Science: e-Science: GRID technologyGRID technology for large-scale for large-scale
distributed collaborative processing of distributed collaborative processing of huge quantities of huge quantities of “facts & their “facts & their relations”relations” (development of large-scale (development of large-scale annotated LRs, linking them across different annotated LRs, linking them across different sources, …)sources, …)
problem of how to coordinate different problem of how to coordinate different information sourcesinformation sources
new ways of extending large-scale LRs new ways of extending large-scale LRs and knowledge bases and knowledge bases relying on relying on volunteer labourvolunteer labour, , wiki-modewiki-mode??
interoperability
interoperability
Towards:Towards: Large online “open source” Large online “open source” collaborative projectscollaborative projects
28Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Need of tools to make this Need of tools to make this vision operational & concretevision operational & concrete
E.g. new prototype built in Pisa E.g. new prototype built in Pisa ((http://xmlgroup.iit.cnr.it:98/MILE/lexflow/demo.xhtml):http://xmlgroup.iit.cnr.it:98/MILE/lexflow/demo.xhtml):
LeXFlow, a web-based collaborative environment LeXFlow, a web-based collaborative environment
for semi-automatic management of lexical for semi-automatic management of lexical
resourcesresources
Is intended to fulfil the requirements posed by Is intended to fulfil the requirements posed by
innovative types of LRs by supporting:innovative types of LRs by supporting: Dynamic language resources, integrating tools for automatic Dynamic language resources, integrating tools for automatic
acquisition of information from corpora and cross-fertilization acquisition of information from corpora and cross-fertilization
of lexiconsof lexicons Content interoperability of resources, by supporting ISLE/ISO Content interoperability of resources, by supporting ISLE/ISO
standardsstandards Cooperative & collective creation and management of LRs, by Cooperative & collective creation and management of LRs, by
providing a web-based environment for the collaboration and providing a web-based environment for the collaboration and
interaction of distributed agents and resourcesinteraction of distributed agents and resources
29Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Why an infrastructure of Why an infrastructure of LRs?LRs?
Because what is special in Language data …Because what is special in Language data …
… … is what is more difficult wrt hard sciences, is what is more difficult wrt hard sciences,
i.e. i.e. “language” and its “ambiguity” “language” and its “ambiguity”
Already Already in the ENABLER Mission:in the ENABLER Mission:
Availability of LRsAvailability of LRs also a also a “sensitive” “sensitive” issueissue, , touching the sphere of linguistic & touching the sphere of linguistic & cultural identity, but also with cultural identity, but also with economical & political implicationseconomical & political implications
Putting togetherPutting together technical, technical, organisational,organisational,strategic, strategic, political political issues of LRsissues of LRs
30Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
Cultural issuesCultural issuesLanguage … and cultural identity cultural identityLanguage … and the Humanities the Humanities
Why an infrastructure of Why an infrastructure of LRs?LRs?
Many dimensions around the notion Many dimensions around the notion of languageof language
Economic, social issuesEconomic, social issuesApplications
Services Technical issuesTechnical issues
Interdisciplinarity &
Interdisciplinarity &
Multidisciplinarity
Multidisciplinarity
Political issuesPolitical issuese.g. a commonly agreed list of minimal
requirements for “national” LRs: BLARK
Multi
lingua
Multi
lingua
lism
lism
Need of bodies for
Need of bodies for
a broad research
a broad research
agenda & strategic
agenda & strategic
actionsactions
for LT&LRs (W/S /MM)
for LT&LRs (W/S /MM)
Putting togetherPutting together technical, technical, organisational, strategic, organisational, strategic, political political issues of LRsissues of LRs
31Nicoletta Calzolari - Emerging technologies for Digital Libraries, Poland, November 2006
HumanitiesHumanitiesSocial SciencesSocial SciencesDigital Digital LibrariesLibrariesCultural Cultural HeritageHeritage……
Many Many applicationapplication domains domains ((eculture, egovernment, ehealth, …)eculture, egovernment, ehealth, …)
corecore
Multilinguality
EnablinEnabling g
infrastrinfrastr
forfor
onon
Focus on cooperationFocus on cooperation
Technologies exist, but the infrastructure Technologies exist, but the infrastructure that puts them together and sustains that puts them together and sustains them is still missingthem is still missing