Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro.

Post on 25-Dec-2015

221 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

Transcript

Dan CristeaAlexandru Ioan Cuza University of Iasi

Romanian Academy – Institute of Computer Science

dcristea@info.uaic.ro

According to Ethnologue – Languages of the World (SIL)◦ Spoken in: Romania (22 millions), Moldavia (2.7 millions),

300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc.

◦ Native speakers: 24 millions, +4 millions as a second language

◦ Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) ◦ Linguistic lineage: Indo-European>Italic>Romance>Eastern ◦ Dialects: Istro Romanian (Croatia), Macedo  Romanian

(Greece), Megleno Romanian (Greece)◦ Lexical similarity: 77% with Italian, 75% with French, 74% with

Sardinian, 73% with Catalan, 72% with Portuguese and Rheto-Romance, 71% with Spanish

◦ Other influences: Slavic, Hungarian, Turkish, etc.

2LT Days, Luxembourg, 14-15 Jan, 2009

Since 1900: linguistics & lexicography research (in the Academy and the universities)

1960: early trials of Machine Translation; after that – no financing for more than 45 years

1980s: first NLP models and systems◦ semantic networks, dialogue systems (IURES,

QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc.

Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj-Napoca, Timisoara)

3LT Days, Luxembourg, 14-15 Jan, 2009

Master level: ◦ Iasi (UAIC-FII, since 2001), University of Bucharest

PhD level: ◦ Bucharest (RACAI), Iasi (UAIC-FII)◦ 6 PhD thesis will be defended this year

Summer schools, international and national conferences EUROLAN, since 1993, second as significance in Europe

(after ESSLLI) SPED (since 2001) – Speech Technology and Human-

Computer Dialogue conferences ConsILR (since 2002) – the national conference of the

Consortium for Informatisation of the Romanian Language Alumni:

◦ >30 PhDs and PhD students doing LT all over the world

4LT Days, Luxembourg, 14-15 Jan, 2009

Bucharest ◦ Romanian Academy, RACAI (acad. Dan Tufis)

10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services

◦ Romanian Academy, Institute of Linguistics (acad. Marius Sala) lexicography, old Romanian texts corpora

◦ University of Bucharest formal models, resources

◦ Technical University of Bucharest & Military Academy speech processing (prof. Corneliu Burileanu, prof.

Olteanu)5LT Days, Luxembourg, 14-15 Jan, 2009

Iasi◦ Alexandru Ioan Cuza University – Dept. of

Computer Science (UAIC-FII, my group) 8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of

Letters), 4 researchers, >20 masters in CL, undergraduate projects

resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces

◦ Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu) 4 PhDs, 8 researchers speech processing and resource building, tools and

annotated resources in written language processing◦ Romanian Academy, Institute of Philology

lexicography, old manuscripts (including in old Cyrillic)

6LT Days, Luxembourg, 14-15 Jan, 2009

Word Alignment (Ro-En): ◦ RACAI 2003, 2005: ranked first

Question Answering (CLEF - Ro, En): ◦ RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2◦ UAIC 2008: Ro-Ro 1/2

Answer Validation Exercise (CLEF - En)◦ UAIC 2007: 1/7, 2008: 1/7

Anaphora Resolution Exercise (En): ◦ UAIC 2007: ranked first

Textual Entailment (En): ◦ UAIC 2007: 2-way task – 3/26, 3-way task – 4/10◦ UAIC 2008: 2-way task – 2/26, 3-way task – 1/13

7LT Days, Luxembourg, 14-15 Jan, 2009

Morphological and POS tagger (En/Ro) Lemmatizer (En/Ro) Dependency Linker (En/Ro) Sentence splitting (En/Ro) Spell checker (Ro) Word aligner (En-Ro) Anaphora resolver (En/Ro) Discourse parser (En/Ro) Summarisation (En/Ro) Q&A (En/Ro) SMT (En-Ro-En, En-Gr-En, En-Sl-En) Definitions extractor (En/Ro) Information Retrieval (Ro Wikipedia)

8LT Days, Luxembourg, 14-15 Jan, 2009

Ro WordNet aligned with Princeton En WN (ILI)◦ the second largest in the world (55,000 synsets)

Mono and multilingual corpora◦ various RO classical novels (about 3,000,000 words)

richest annotation: Orwell’s “1984” (110,000 words)◦ tagged, lemmatized, chunked, word-aligned (XCES):

Semcor (En, Ro): 1,000,000 words Ev.Zilei (En, Ro): 1,000,000 words Acquis Communautaire (22 languages), Ro: 30,832,212 words Wikipedia-Ro (fragment): 3,405,324 words

◦ dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR)

Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc.

9LT Days, Luxembourg, 14-15 Jan, 2009

European past: ◦ ELSNET (ESPRIT), ELSNET-Goes-EAST

(Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea)

European active: ◦ CLARIN: design & build the European LT

infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions)

◦ FlareNet: Nicoletta’s speech◦ ALEAR: models of language evolution in

humanoid agents (robots): unification optimisation and discourse modelling

10LT Days, Luxembourg, 14-15 Jan, 2009

Language Technology and preservation of national heritage – national priorities in the Ro research plan

Massive financing over the last 2 years (compared to previous)…

11LT Days, Luxembourg, 14-15 Jan, 2009

◦ Under the Ministry Culture and Arts (dir. Dan Matei)

◦ Digitisation of the Ro literature

12LT Days, Luxembourg, 14-15 Jan, 2009

13LT Days, Luxembourg, 14-15 Jan, 2009

@ RACAI A follow up of a successful SEE-ERA.net

project (Ro, Bg, Gr, Sl, Sr) Encouraging pilot experiments for Ro-En-Ro,

Gr-En-Gr, Sl-En-Sl

14LT Days, Luxembourg, 14-15 Jan, 2009

Language pair Google translation RACAI translation

NIST score BLEU score NIST score BLEU score

English to Greek 3.5705 0.2934 3.9730 0.3533

English to Slovene 3.5340 0.2653 3.6719 0.2450

English to Romanian 4.4057 0.4508 4.9348 0.5464

Greek to English 3.5427 0.2868 3.7733 0.2981

Slovene to English 4.0424 0.2215 4.0589 0.2293

Romanian to English 4.3573 0.2827 4.5426 0.4604

ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards)

build a pipeline/parallel architecture without any need to program

just input your own file and indicate the form of the output

use the federation of tools as bricks for new applications cooking: the more ingredients you have, the list of

possible recipes you may go for increases

15LT Days, Luxembourg, 14-15 Jan, 2009

◦ Explosion of formats difficulty of standardisation◦ Standards are like laws: they help to organise the

society, but they also reduce freedom◦ Standards usually come late◦ We are in a hurry to do thinks instantly

Invent heuristics able to guess the semantics of new formats

‘Compute’ wrappers to transform non-standard input into standard

16LT Days, Luxembourg, 14-15 Jan, 2009

17LT Days, Luxembourg, 14-15 Jan, 2009

top related