Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDA ELRA/ELDA KC/1 Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30 Email: [email protected]Web: http://www.elda.fr/ European Language Resource Association A European Infrastructure for Language Resource distribution And HL Technology evaluation
55
Embed
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDA KC/1 Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/1
ELRA Catalogue -- A quick overview– BLARK ….ELRA Catalogue -- A quick overview– BLARK ….Activities in Europe / European & National scenesActivities in Europe / European & National scenes & Role of ELRA & Role of ELRAThe ENABLER Initiative The ENABLER Initiative ConclusionConclusion
Outline
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/3
European Language Resource Association An Improved infrastructure for Data sharing
Centralized Not-for-profit organization for the collection, distribution, and validation of
Language Resources and tools.
Evaluation & Language Resources
Distribution Agency
Operational agency ELDA
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/4
The Association
• Membership Drive:ELRA is Open to European & Non-European InstitutionsResources are available to Members & Non-Members
Pay per Resource
Substantial discounts on LR prices (over 70%)Legal and contractual assistance with respect to LR mattersAccess to Validation and production manuals (Quality assessment)Figures and facts about the Market (results of ELRA surveys)Newsletter and other publications
• Some of the benefits of becoming a member:
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/5
European Language Resource Association An Improved infrastructure for Data sharing
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/11
Quick OverviewBasic Language Resources --- Spoken Written Resources
What should be available for all languages:
· Articulatory databases (e.g. ACCOR)· Basic speech data
(some phonetic material and some phonetic sequences, by a small number of speakers, recorded in a quiet environment (EUROM 1 & BABEL)
· Pronunciation lexicon (BDLEX, PHONOLEX)· Proper names pronunciation lexicon (ONOMASTICA)· Newspaper read text (BREF, Siemens-100, Apasci)· Basic telephone speech (SPEECHDAT)· Telephone-based speaker verification. (PolyVar)· Text corpora for language models (MLCC, Le Monde …)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/12
BLARK ..Basic LAnguage Resource KitSpeech Resources fre-fr spa-es nor-no ger-deBroadcast speech -- e--- eArticulatory database E E EMicrophone/desktop speech E E e ERead newspaper texts E ETelephone speech database E E E EMobile-radio speechPronunciation lexicon E e EOnomasticon e e e ESpeaker identification speech corpus
Text Corpora fre-fr spa-es nor-no ger-deBroadcast text corpusConversation text corpus eNewswire text corpus E E EMonolingual corpus E e e EMultilingual and parallel corpus E E e ETreebank e
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/13
Speech UK I F SF BF G EI SP Cat Bq Pt Gr ThN Dan Sw Nor FinnChSpCSp Cz Hun Est RomLatv PLSlovaArticulatory database A E E E E E E
Basic speech data A A A E E E E E E E E E S S S SPronunciation lexicon A A A A
Proper names pron. lex. E E E A*** E E E E A A A A A ANewspaper read text A A A E E A
Basic telephone speech A A A A U A U A A E A A A S U A A U U U UTeleph. speaker verif. A A
text corpora for language ModelsA A A A A A
A Available through ELRAS Available through ELRA within the next quarterE Exist/identified but not (never!) available"blank" Probably Not available / has not been identified
U Under completion/Well advanced project with distribution plans
** We exclude the lexicon that come with SpeechDat*** Available through German telecoms
Basic Speech resources -- (Europe)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/14
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/18
SpeechDat Family
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/19
SpeeCon Project
Dialectal zone Language Region Remarks Esl_ES Spanish Spain (excluding Latin America) Rus_RU 1) Russian Russia Ita_IT Italian Italy Sve_SE_FI Swedish Sweden and Finland Deu_DE_AT German Germany and Austria (excluding e.g. Belgium, Luxembourg,
Switzerland) Eng_GB English United Kingdom Dan_DK Danish Denmark Dut_BE Dutch Belgium Fra_CA French Canada Fra_FR French France (excluding e.g. Belgium, Luxembourg,
Switzerland) Fin_FI Finnish Finland Zho_CN_HK Mandarin P. R. China (incl. Hongkong) (excluding e.g. Taiwan) Dut_NL Dutch The Netherlands Jpn_JP Japanese Japan Pol_PL Polish Poland Por_PT Portuguese Portugal (excluding Brazil) Deu_CH German Switzerland Eng_US English USA (excluding e.g. Canada)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/20
SpeechDat Family: SALA-II what you may get with PRIVATE Funding
SALA II cellular/Mobile Network (1000 speakers)
Partner Latin America US and Canada ATLAS Venezuela Loquendo Chile, US English South Lucent, Argentina Microsoft Peru US English North NSC Mexico US English Midland Philips Brazil US Spanish West Siemens Colombia US English West Telisma Costa Rica Canadian French
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/21
Brief Overview of recent activities at National level
Top-down vs Bottom-up approches
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/22
Examples of National Projects/programs
Netherlands & Belgium: Continue Now Release 5
Data Available via ELRA, Release of April2002
OVER Nine National projects, among which :
France: Action Techno-Langue
Italy : Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta
Norway : Norwegian Language Bank
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/23
Dutch & Flemish
Release 1 (March 2000) · 62 hours speech samples orthographically transcribed (615,000 words), 90,000 words enriched with
Part-of-Speech tags; · annotation CD with first version of PRAAT (annotation tool) and first version of documentation (in
Dutch) among which relevant information on the speakers (e.g. gender, age, socio-economic class) and samples (e.g. recording conditions, the equipment) (information on the speakers in anonymous form);
Release 2 (October 2000) · over 150 hours of speech samples, orthographically transcribed (over 1,500,000 words), approximately
750,000 words enriched with Part-of-Speech tags; · annotation CD with annotation protocols and relevant information on the speakers (e.g. gender, age,
socio-economic class) and samples (e.g. recording conditions, the equipment) is available (information on the speaker in anonymous form);
Release 3 (April 2001) · more orthographically data enriched with Part-of-Speech tags; · the first broad phonetic transcriptions, word alignments, syntactic annotations, lexicon link-up will be
available; · annotation CD with documentation among which relevant information on the speakers (e.g. gender,
age, socio-economic class) and samples (e.g. recording conditions, the equipment); this release encompasses the first version of Corex, the exploitation tool.
ELRA/ELDAELRA/ELDAKC/24Bergen 2002/10/24-25 Norwegian Language Bank
2 National projects under 2 different “Programs”. 2 National projects under 2 different “Programs”. The Programs were not specific for HLT, but general:The Programs were not specific for HLT, but general:
one for industrial R&Done for industrial R&Dand the other for the South of Italy.and the other for the South of Italy.
Both projects are coordinated by A. Zampolli in Pisa.Both projects are coordinated by A. Zampolli in Pisa.
Goal: to extend core resources built in EU projects, Goal: to extend core resources built in EU projects, create new LR, the tools needed to manage the create new LR, the tools needed to manage the resources, a platform for NLP development, and resources, a platform for NLP development, and technology transfer towards SME.technology transfer towards SME.
Example of ItalyNational Projects/programsNational Projects/programs Example of ItalyExample of Italy
With Contribution from N. Calzollari and A. ZampolliWith Contribution from N. Calzollari and A. Zampolli
ELRA/ELDAELRA/ELDAKC/25Bergen 2002/10/24-25 Norwegian Language Bank
TAL - Infrastruttura nazionale per
le risorse linguistiche nel settore del trattamento automatico della
lingua naturale parlata e scritta
with 13 partner of private organisations).
Duration: 2 years, finished in 2002.
Partners:
CPR - Consorzio Pisa Ricerche; ITC - Istituto Trentino di Cultura; CSELT - Centro Studi e Laboratori Telecomunicazioni; SYNTHEMA; CVR - Consorzio Venezia Ricerche; CERTIA - Centro per la Ricerca, Sviluppo, Formazione nelle Tecnologie e Applicazioni Informatiche; QUINARY; ALCEO;COMPUTER SHARING; DELCO; GST - Gruppo Soluzioni Tecnologiche; INTERACTIVE MEDIA; NECSY - Network Control Systems
National Projects/programsNational Projects/programs Example of ItalyExample of Italy
ELRA/ELDAELRA/ELDAKC/26Bergen 2002/10/24-25 Norwegian Language Bank
Istituto Universitario Orientale, Napoli; Dipartimento di Scienze Storiche del Mondo Antico, Università di Pisa; Sportello per la Cooperazione Scientifica e Tecnologica con i Paesi del Mediterraneo (SMED) del CNR, Napoli.
LCRMM –
Linguistica computazionale: ricerche monolingui e multilingui
(cluster "Linguistica", legge 488, with 16 partners of private and public organisations).
•Duration 3 years: will finish in 2003.
National Projects/programsNational Projects/programs Example of ItalyExample of Italy
ELRA/ELDAELRA/ELDAKC/27Bergen 2002/10/24-25 Norwegian Language Bank
National Projects/programsNational Projects/programs Example of Italy Example of Italy
Italy : Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento automatico della lingua naturale parlata e scritta
•ItalWordNet (~50.000 entries).
•Corpus di italiano parlato --- 100 Hours of speech consisting of :
a) 10h Radio-TV broadcast data (notiziari, interviste, talk show), b) 60h Map task like collection c) 5h Lab data for lexical coveraged) 10h telephone conversational speech e) 10h Domain specific (finances, touristic information etc.)
•Annotated dialogues for speech interfaces (H-H and H-M interactions)( Dialoghi annotati per applicazioni di interfacce vocali avanzate)450 dialogues annotated at all levels (morphological … Prosody…Semantics ….)
Bergen 2002/10/24-25 Norwegian Language Bank
ELRA/ELDAELRA/ELDAKC/28Bergen 2002/10/24-25 Norwegian Language Bank
National Projects/programs National Projects/programs Example of Italy Example of Italy
Bergen 2002/10/24-25 Norwegian Language Bank
to extend core resources built in EU projects, created new LR, the tools needed to manage the resources, a platform for NLP development, and technology transfer towards SME.
The total cost was about 7 million euro and funding for almost 5 million euro
The costs were equally divided between Spoken & Written areas.
In both projects the consortia agreed to distribute the LR through ELRA (with special price for Italian users).
Now, after the conference TIPI in Roma, under the sponsorship of the Ministry of Communications, the topic of HLT has been inserted in the Framework Programme for the financing of R&D in Italy.
It was also decided to constitute a Forum for HLT, of which Zampolli is president. The Forum will start working soon, also to prepare new national initiatives, to maintain LR, to write a white book on HLT in Italy, to coordinate with national activities in other EU countries, etc.
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/29
With Contribution from J. MarianiWith Contribution from J. Mariani
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/30
Ministère de la Culture et de la CommunicationMinistère de la Culture et de la Communication
Ministère de la Jeunesse, de l’Education Nationale et de la RechercheMinistère de la Jeunesse, de l’Education Nationale et de la Recherche
Ministère de l’Economie, des Finances et de l’IndustrieMinistère de l’Economie, des Finances et de l’Industrie
Language TechnologiesLanguage Technologies
« TechnoLangue » Action« TechnoLangue » Action
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/31
•Report to Prime Minister (November 2000)Report to Prime Minister (November 2000)•Meeting Min. Industry, Research, Culture: June 2001Meeting Min. Industry, Research, Culture: June 2001•Action : Technology survey and evaluationAction : Technology survey and evaluation•Basic Technological ResearchBasic Technological Research•Articulate with present actionsArticulate with present actions
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/33
BasicResearch
TechnologyDevelopment
ApplicationDevelopment
BottleneckIdentification
Research resultsin quantitative
evaluation
Technologiesnecessitated
for applications
Technologieswhich have been
validatedfor applications.Long term / high risk
Large return of investment EvolutionaryUsability
Acceptability
Meeting points with technology development
QuantitativeEvaluation
UsageEvaluation
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/34
« TechnoLangue » action« TechnoLangue » action
• OrganizationOrganization– Executive Committee (EC) chaired by C. Fluhr (CEA)– Comprising 15 members:
• 3 RRIT representatives: B. Bachimont (INA - RIAM), C. Sedogbo (Thalès - RNTL), C. Waast (IBM - RNRT)
• 3 Public research: C. Fluhr (CEA), E. Geoffrois (DGA) P. Paroubek (Limsi-CNRS)
• 5 Industrials: K. Choukri (ELDA), B. Normier (Lingway), J.-J. Rigoni (Elan Informatique ), F. Segond (Xerox) + C. Sorin (FT R&D)
• 4 Administrations: S. Chaudiron (MR), J. Mariani (MR), D. Malbert (MCC), J. Mathieu (MinEFI)
– Good balance between research & industry - written/spoken
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/35
« TechnoLangue » action« TechnoLangue » action
• Install a User CommitteeInstall a User Committee– Ministry of Foreign Affairs
• Automatic translation, multilingualism…
– Ministry of public administration• Simplification of the administrative language...
– Ministry of National Education• Training technologies, language traning...
– …
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/36
« TechnoLangue » Call« TechnoLangue » Call
• International cooperationInternational cooperation– Cooperation mechanisms within TechnoLangue
• foreign entities may participate in the projects
• financing from their own funds
– Future cooperation among similar national programs• EU Countries (Italy, Germany, Norway, Spain, Greece, The
Netherlands, Switzerland…)
• Prepare the construction of the European Research Area (ERA)
– The EC supports the coordination and generic technologies cost
– Each country supports the cost for covering its language(s): specific technology development/adaptation: (annnotated) corpus (spoken/written), lexicon (incl. pronun.), dictionaries...
• USA, Japan, South Africa…
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/37
« TechnoLangue » Call« TechnoLangue » Call
• 4 meetings of the Executive Committee4 meetings of the Executive Committee• A Call for Proposals with 4 partsA Call for Proposals with 4 parts
– Part 1: Language resources
– Part 2: Evaluation
– Part 3: Norms & standards
– Part 4: Technological survey
• Calendar:Calendar:– Launched April 15, 2002
– Deadline : May 31 / June 10 (Electronic) - June 17 (Paper)
– Results : July 19, 2002
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/38
« TechnoLangue » Call« TechnoLangue » Call– Language resources
• Spoken/written data (corpus, dictionaries, terminological data…)
• Basic Language Processing Tools (Open Source)
• Production, validation, distribution (incl. legal, economical aspects)
• For a large use by a large community (education, training…)
– Evaluation• Technology (evaluation campaign)
• Applications (evaluation toolkits)
• Methodology (metrics / protocols)
– Norms & standards• Shared effort to improve French participation
– Technological survey• In relationship with on-going actions (Euromap...)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/39
Part 1: Language ResourcesPart 1: Language Resources
• Stimulate the production and the distribution of language Stimulate the production and the distribution of language resources for :resources for :– answering minimal needs (Basic LAnguage Resource Kit) for the
french language ;– promoting resources reusabilty ;– supporting research ;– helping industrial applications development ;– decreasing the cost of entering the sector for new comers
• Should include the French language, eventually in Should include the French language, eventually in connection with other languagesconnection with other languages
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/40
Part 1: Language ResourcesPart 1: Language Resources
• Spoken and written data :Spoken and written data :• oral corpus, pronunciation lexicons, etc.
• databases for speech synthesis ;
• monolingual and multilingual text corpus (parallel, comparable...) ;
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/41
Part 1: Language ResourcesPart 1: Language Resources
• Encourage and facilitate the use of those resourcesEncourage and facilitate the use of those resources– Putting them in new (young) user hands
– Same approach as for GUIs : “VUIs”
– Language Technology Kits with “User’s guide”• Distribution towards specialized education entities (NLP, Document
Engineering…) and more largely towards training centers (Universities, Technical Universities, Engineering schools...)
• While insuring a feedback from experience
– Open Source software economical model
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/42
Part 2: EvaluationPart 2: Evaluation
• 3 areas :3 areas :– Technology evaluation
– Application evaluation
– Evaluation methodologies
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/43
Part 2: EvaluationPart 2: Evaluation
• Technology evaluationTechnology evaluation– Organization of comparative evaluation campaigns for
technologies presently not covered by european or international programs, or with a complementary approach
– Includes the production of the data necessary for the evaluation, in a monolingual, multilingual or crosslingual context
– Scientific and industrial interest of the evaluation should appear (large enough number of participants)
– The projects must define the evaluation methodology and justify the practical organization aspects
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/44
Part 2: EvaluationPart 2: Evaluation
• Application evaluationApplication evaluation– The objective is to develop evaluation mehodologies for
industrial or pre-industrial products
– The methodologies may result in “toolboxes”, also regrouping user-oriented methodologies and protocols, or in test software packages
– The methodologies should be generic (class of applications)
– The proposals should demonstrate the project economical and industrial interest, and the modalities of the distribution of the “toolboxes”
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/45
Part 2: EvaluationPart 2: Evaluation
• Evaluation methodologiesEvaluation methodologies– Improve the present evaluation methodologies
– Identify new (quantitative and qualitative) approaches for already evaluated technologies :
• socio-technical and psycho-cognitive aspects
• cognitive modeling of evaluation
– Identify protocols for new technologies and applications• Virtual Reality, Multimodal interaction, Language on the Internet...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/46
Part 3: StandardsPart 3: Standards
• Support the participation of French actors in Support the participation of French actors in normalization and standardization bodiesnormalization and standardization bodies– Presently weak participation of French actors in
normalization and standardization bodies
– Of strategic importance
– Variety of places where the normalization activities are taking place : official or non-official committees, forums, projects,...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/47
Part 3: StandardsPart 3: Standards
• Actions:Actions:– Support the creation of consortia to reinforce the french
presence in various bodies (ISO, CEN, W3C,...)
– Help the share of efforts among French participants
– Identify a topic and ensure a permanent participation in all related bodies : character sets, exchange format, phonetic alphabet transcription, etc.
– Necessity of articulating the project with French bodies already implied : AFNOR, W3C French Chapter,...
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/48
Part 4: SurveyPart 4: Survey
• Part 4 - Install an information surveyPart 4 - Install an information survey– Create a portal on Language Engineering in order to give access
to :• panorama of the industrial and technological offer• state-of-the-art in science and technology• identification of language resources• identification of technological bottlenecks• a list of Call for Proposals• a presentation of the market key numbers• an information on norms and standards (with Internet links)
– Should be linked with existing sites (Euromap,...)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/49
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/52
French Techno-Langue ConclusionsFrench Techno-Langue Conclusions
• Launch a large national program on Language Launch a large national program on Language Technology (TechnoLangue)Technology (TechnoLangue)
• In the perspective of installing a permanent In the perspective of installing a permanent infrastructure for Language Resources, Evaluation, infrastructure for Language Resources, Evaluation, Standards and SurveyStandards and Survey
• Hope that it can participate in the construction of the Hope that it can participate in the construction of the European Research AreaEuropean Research Area
• And articulates well with international activitiesAnd articulates well with international activities
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/53
Example of NORWAY Example of NORWAY National Projects/programsNational Projects/programs
Norway : Norwegian Language Bank
language technology resources in Norway
Launch conference 24-25 October 2002 (Bergen, Norway):
The language bank will contain three types of data spoken data, text and lexical resources.
It will be organized as a foundation with state ownership,
The estimated budget is about NOK 100 million, (12 M€)
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/54
ENABLER European National Activities for Basic Language Engineering & Resources
Survey of existing national activities
Fostering common research and compatibility of LR
Suggestion for and contribution to international
cooperation
-- A new InitiativeIdentification of existing resources (Universal Catalogue)The Basics (e.g. Standards, tools, evaluation procedures, …)
Extension foreseen/ Planned
Next meeting Pisa 1st December 2002
Bergen 2002/10/24-25 Norwegian Language Bank ELRA/ELDAELRA/ELDAKC/55
Information Dissemination
(Bilingual English/French; issued each quarter)
Catalogue
Web Site (Bilingual: English/French)
Web: http://www.elda.fr/
Newsletter
ELRA Conference (LREC)International Language Resources & Evaluation