Top Banner
Ixa Group Basque language Readability Assessment Text Simplification Current and near future work Readability Assessment and Text Simplification for Basque in the Ixa Group Itziar Gonzalez-Dios Supervisors: Mar´ ıa Jes´ us Aranzabe and Arantza D´ ıaz de Ilarraza IXA NLP Group, University of the Basque Country (UPV/EHU) ixa.eus/Ixa @IxaGroup Pisa, 2015 Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 1/38
42

Readability Assessment and Text Simplification for Basque ...

Feb 15, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment and TextSimplification for Basque in the Ixa Group

Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza

IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa

@IxaGroup

Pisa, 2015

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 1/38

Page 2: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Page 3: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Page 4: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Page 5: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Page 6: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 2/38

Page 7: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 3/38

Page 8: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ixa Group

Research group at the University of the Basque Country(UPV/EHU)

Since 1988

64 members

10 subgroups

Computer Science Faculty of Donostia-San Sebastian

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 4/38

Page 9: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Our philosophy

Bottom up conception (progressive development)

Reuse of resources and tools

Open source: Ixa pipes http://ixa2.si.ehu.es/ixa-pipes/

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 5/38

Page 10: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Research lines

Creation of basic resources (linguistic resources andprocessors):

Corpora, dictionaries, ontologiesComputational lexicography, morphology, syntax, semantics,pragmatics and discourse

Operational aspects (integration of language tools):

Corpus processingParallel processingCorpus annotation

Language technology applications:

Information extraction and question answeringMachine translationLanguage teaching/learning

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 6/38

Page 11: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Research lines

Projects (now):

European: 4National: 4Regional: 3

PhD thesis:

In progress: 19Done: 38

Languages:

Mainly, BasqueEnglish, SpanishQuechua

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 7/38

Page 12: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 8/38

Page 13: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Basque language

Origin:

Pre Indo European LanguageIsolated

Today, 5 dialects + standard (+ 2 almost lost, + another onedocumented)

Geographical domain:

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 9/38

Page 14: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Sociolinguistic info

800.000 native speakers; 1 million understand/speak something

Official: Araba, Bizkaia and Gipuzkoa (the Autonomous Communityof the Basque Country); The north of Navarre

Not official: Lapurdi, Behe-Nafarroa and Zuberoa (Together withBearn, Pyrenees-Atlantiques); Navarre

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 10/38

Page 15: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Typology

Agglutinative; Case system: ergative-absolutive; 18 case endings

Head final; Free word order at sentence level

6 vowels, 25 consonants

Example sentences

(1) Mutilakboy-erg

sagarraøboy-abs

janeat-prf

du.aux-3sgerg3sgabs.prs.ind

’The boy has eaten an apple.’

(2) Sagardoaøcider-abs

dastatzekotaste-ven.adn

prestøready-abs

dagoeneanstare-3sgabs.prs.comp.loc

irekitzenopen-ipf

dirabe-3plabs.prs

sagardotegiakø,cider-house-pl.abs,

normaleannormal-loc

urtarrilarenjanuary-gen

20tik20-abl

Aste Santura.eastern-adl

’Cider houses open when the cider is ready to taste, usually from the 20th ofJanuary to Eastern.’

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 11/38

Page 16: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 12/38

Page 17: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment

IAS

Essay scoring system

ErreXail

Simple vs. complex

Ion Madrazo’s work

B1, B2, C1, C2

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 13/38

Page 18: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

IAS

Idazlanen Autoebaloaziorako Sistema (IAS) Auto-evaluation ofessays (Castro-Castro et al., 2008)

Clause number in a sentenceTypes of sentences (questions, negations...)Clause types (temporal, causal...)PoS typesLemma number

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 14/38

Page 19: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail

Readability assessment system (Gonzalez-Dios et al., 2014)

measures 96 ratio based on linguistic informationuses Machine Learning techniquescollected two corpora of scientific divulgation for adults and children

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 15/38

Page 20: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Linguistic features

Global features: sentence length, word length, sentence number (3ratios)

Lexical features: PoS, lemmas, named entities... (39 ratios)

Morphological features: case markers, verb types, verbmorphology... (24 ratios)

Morphosyntactic features: noun phrases, verb phrases,appositions (5 ratios)

Syntactic features: subordinate clauses (10 ratios)

Pragmatic features: types of connectors and conjunctions (12ratios)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 16/38

Page 21: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Classification results

Experiments AccuracyAll features 89.50

Lexical features 90.75Lex+Morph+Morph-sint+Sintax 93.50

Table: Classification results with SMO and 10 fold cross-validation

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 17/38

Page 22: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

ErreXail: Most predictive features

Features and groups Relevance (InfoGain)Proper nouns / common nouns ratio (Lex.) 0.2744

Appositions / noun phrases ratio (Morpho-synt.) 0.2529Appositions / all phrases ratio (Morpho-synt.) 0.2529Named entities / common nouns ratio (Lex.) 0.2436Unique lemmas / all the lemmas ratio (Lex.) 0.2394

Acronyms / all the words ratio (Lex.) 0.2376Causative verbs / all the verbs ratio (Lex.) 0.2099

Modal-temporal clauses / subordinate clauses ratio (Synt.) 0.2056Destinative case endings / all the case endings ratio (Morph.) 0.1968Connectors of clarification / all the connectors ratio (Prag.) 0.1957

Table: Most predictive features

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 18/38

Page 23: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ion Madrazo’s master thesis (2014)

More linguistic features

DependenciesDepth of the syntactic treeN-gramms at PoS and dependency levelUse of synonymsLatent Semantic Analysis

Other ML techniques

Algorithms to choose the features (Information Gain and CorrelationFeature Selection)Meta algorithms for classification (Ordinal Classification and CostSensitive Learning)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 19/38

Page 24: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Ion Madrazo’s master thesis: Results

Most significant features for each level (B1, B2, C1, C2)

Best results with multinomial Naive Bayes -> % 61.69 accuracy

State-of-the-art results

Similar results with meta algorithms

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 20/38

Page 25: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 21/38

Page 26: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Lexical simplification

Begun in 2015

Maria Eguimendia’s work for her master thesis

Resources:

A list of lemma frequency from the Corpus Lexikoaren Behatokia(41.773.391 words)Basque WordNetUKB (Word Sense Disambiguation)NAF as input (multilingual)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 22/38

Page 27: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Syntactic Simplification

Begun in 2011

Two main lines:

Linguistic analysis of complex sentences to propose simplificationrulesDeveloping or adapting the tools to perform the automaticsimplification (architecture of the EuTS system)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 23/38

Page 28: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Resources

Corpora:

Reference Corpus for the Processing of BasqueConsumer Corpus (used in Machine Translation)WikipediaElhuyar Corpus (scientific divulgation)

Grammar:

Descriptive Grammar of Basque by Euskaltzaindia (Academy)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 24/38

Page 29: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Tasks

Analysis of each clause type to propose simplification rules

Define a simplification process

Analysis of the frequency and position of each adverbial structurefound in the grammar

Check if the proposed rules are also valid in other domains

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 25/38

Page 30: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Simplification process

Spliting: Make as many new sentences as clauses out of the original

Reconstruction: Two operations take place:

Removing no longer needed morphological featuresAdding adverbs or phrases to maintain the meaning

Reordering: Reorder the elements in the new sentences, andordering the sentences in the text

Correction: Correct the possible grammar and spelling mistakes,and fix punctuation and capitalisation

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 26/38

Page 31: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Example

Simplification proposal of concessive clauses

(3) a. Hasiera batean aste honetan partidurik ez jokatzea aurreikusita zegoenarren, azken orduan ostiralean partidu bat jokatu nahi izan du Athleticek.(Although it was not foreseen to play a match this week, at the lastmoment Athletic Bilbao has decided to play one on Friday.)

b. i. Hasiera batean aste honetan partidarik ez jokatzea aurreikusita zegoen.(It was not foreseen to play a match this week.)

ii. Hala ere, azken orduan ostiralean partida bat jokatu nahi izan duAthleticek. (However, at the last moment Athletic Bilbao has decidedto play one on Friday.)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 27/38

Page 32: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Linguistic analysis: Simplification levels

1 Syntactic Substitution Simplification (SSS): Frequency basedsimplification of syntactical structures

2 Natural Simplification (NS): Compound and complex sentenceswith finite verbs simplification will follow the simplification processtogether with the SSS

3 Strong or absolute simplification (AS): Everything is simplified(finite and non finite verbs + SSS)

4 Tailored or customised simplification (CS): Only needed orrequired phenomena

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 28/38

Page 33: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Syntactic Substitution Simplification (SSS)

-tzearren ‘(in order) to’: % 1.86 (15 instances)

-tzeko ‘(in order) to’: % 88.38 (791 instances)

SSS of non finite purpose clauses

(4) a. Abuztuaren amaieran beste goi bilera bat egitea aztertzen ari diraIsrael eta PAN Palestinako Aginte Nazionala, Ekialde Erdiko bakeprozesua suspertzearren. (Israel and the PNA, Palestinian NationalAuthority, are studying to organise another summit at the end ofAugust to promote the peace process in the Middle East.)

b. i. Abuztuaren amaieran beste goi bilera bat egitea aztertzen aridira Israel eta PAN Palestinako Aginte Nazionala, Ekialde Erdikobake prozesua suspertzeko.

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 29/38

Page 34: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Architecture of the EuTS system

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 30/38

Page 35: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Architecture of the EuTS system: Developed or adaptedtools

Improved the clause boundary detection grammar

Developed an apposition detector

Developed a readability assessment system ErreXail

Implemented a splitting algorithm (and reconstruction for therelative clauses)

A tool that simplifies biographical data (multilingual) Biografix

SSS

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 31/38

Page 36: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Biografix: Example

Living people (original)

(5) Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, (Beasain, Gipuzkoa, 1948ko irailaren 6a) sukaldari, aktoreeta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, (Beasain, Gipuzkoa, 6th September, 1948) is a basque chef, actorand businessman.’

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 32/38

Page 37: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Biografix: Example

Living people (simplified)

(6) a. Karlos Arginano Urkiola, nazioartean Karlos Arguinano grafiazezagunagoa, sukaldari, aktore eta enpresaburu euskalduna da.’Karlos Arginano Urkiola, internationally known with the Karlos Arguinanospelling, is a basque chef, actor and businessman.’

b. Karlos Arginano 1948ko irailaren 6an Beasainen jaio zen.’Karlos Arginano was born on the 6th of September, 1948 in Beasain.’

c. Beasain Gipuzkoan dago.’Beasain is in Gipuzkoa.’

Available at http://ixa.si.ehu.es/Ixa/Produktuak/1403535629https://github.com/itziargd/Biografix

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 33/38

Page 38: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Evaluation

Extrinsic and manual evaluation of Biografix

Manual evaluation of SSS

Planed evaluations:

Compare our rules to various approaches of simplification (Corpus ofSimplified Text)Extrinsic evaluation through machine translation (which translator?)Comprehension tests (crowdsourcing platforms)

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 34/38

Page 39: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Lexical SimplificationSyntactic Simplification

Corpus of Simplified Text

First phase:

3 texts of scientific divulgation (medicine, technology and history)3 annotators (different backgrounds)

A court translator with no idea about simplificationA teacher of Basque as foreign languageA philosoph/writer that writes literature in easy Basque (intuitive)

Which operations do they perform?Do they make common operations?Are those operations similar to ours?

Second phase: other domains

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 35/38

Page 40: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Outline

1 Ixa Group

2 Basque language

3 Readability Assessment

4 Text SimplificationLexical SimplificationSyntactic Simplification

5 Current and near future work

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 36/38

Page 41: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Current and near future work

Implementation of EuTS:

Adaptation of the analysis output for the morphology generatorFormalisation of the rules written after the linguistic analysis

Waiting for the annotators of the Corpus of Simplified Text ->Analysis of the operations

Exploring the other evaluation possibilities

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 37/38

Page 42: Readability Assessment and Text Simplification for Basque ...

Ixa GroupBasque language

Readability AssessmentText Simplification

Current and near future work

Readability Assessment and TextSimplification for Basque in the Ixa Group

Itziar Gonzalez-DiosSupervisors: Marıa Jesus Aranzabe and Arantza Dıaz de Ilarraza

IXA NLP Group, University of the Basque Country (UPV/EHU)ixa.eus/Ixa

@IxaGroup

Pisa, 2015

Itziar Gonzalez-Dios Readability Assessment and Text Simplification for Basque 38/38