Top Banner
AUTOMATIC IDENTIFICATION OF PROVERB VARIANTS: AN EXPERIMENT WITH BRAZILIAN PORTUGUESE Amanda Rassi Jorge Baptista Oto Vale PROPOR 2014 - International Conference on Computational Processing of Portuguese October 6-9, 2014 USP-São Carlos, SP, Brazil
23

Rassi et-al propor-2014

Dec 05, 2014

Download

Science

Jorge Baptista

This paper describes a methodology for automatically iden- tifying proverbs and their variants in running texts. This methodology is based on existing compilations of proverbs, by exploring the regular syntactic structures that most proverbs present and intersecting syntac- tic structure with the lexical units of the proverbs. From the syntactic regularities we divided the data into 13 different classes. Finite-state au- tomata is used to represent the regular patterns found in the classes. The results showed a precision rate of 74.68% tested in Brazilian Portuguese journalistic corpus.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rassi et-al propor-2014

AUTOMATIC IDENTIFICATION OF PROVERB VARIANTS: AN EXPERIMENT WITH BRAZILIAN PORTUGUESE

Amanda Rassi Jorge Baptista

Oto Vale

PROPOR 2014 - International Conference on Computational Processing of Portuguese October 6-9, 2014 USP-São Carlos, SP, Brazil

Page 2: Rassi et-al propor-2014

Proverbs• Definition

• a type of multiword expressions (micro-texts) • special citation status • express atemporal truths • combinatorial and lexical constraints • sentences syntactically identical to ordinary sentences • common lexicon !

• Delimitation • Proverbs ≠ frozen sentences (or idioms) • Proverbs: subject position necessarily filled

by a fixed element vs. Idioms: subject position distributionally free (in most cases)

2

Page 3: Rassi et-al propor-2014

Goals• Automatically detect proverbs in texts, even when they are

not introduced by any linguistic “quoting” devices: !Como dizem ‘as they say’ Como dizia minha avó ‘as my grandmother used to say’ Dizem por aí ‘people say/they say’ Costuma dizer-se ‘it is often said’

!• Identify the variants of proverbs,

considering both formal and lexical variations.

3

Page 4: Rassi et-al propor-2014

Related work• For French and Italian:

Conenna (1998, 2000, 2004) and Lacavalla (2007) !• For French and Spanish:

Brotons (2008) !• For European Portuguese (EP):

Chacoto (2006, 2007, 2008) !• For Brazilian Portuguese (BP):

no formal description

4

Page 5: Rassi et-al propor-2014

Motivation• though relatively rare, proverbs are “islands of

meaning” in texts (citation status) • often difficult to spot,

• lack formal marks • formal and lexical variation

• often enter in wordplay • discursive function is complex (entailment) • relation with other textual elements disturbs

(no coreference)

5

Page 6: Rassi et-al propor-2014

Methods• create a database with proverbs; • define syntactic criteria to organize the collected

proverbs into similar formal classes; • organize the elements according to POS; • produce tables of core elements; !with Unitex 3.1 (Paumier 2003, 2014)

• create reference graphs with the basic syntactic structures for each class;

• intersect the graphs with the tables of the proverbs’ core elements to produce finite-state transducers, which can then be applied to texts.

6

Page 7: Rassi et-al propor-2014

Collection of proverbs• 5 different sources:

• list of proverbs in Wikipedia • grand book of proverbs (Teixeira, 1942) • 1001 proverbs (Steinberg, 1985) • book of proverbs (Pinto, 2003) • dictionary of proverbs (Magalhães Jr., 1974) !

• Original list of 3,502 proverbs (and their variants) • Final list of 594 proverbs (types or base-forms)

7

Page 8: Rassi et-al propor-2014

Classification criteria• number of verb phrases/clauses (P1, P2 and P3)

!• in P1

• impersonal constructions • the verb is a copula verb • obligatory negation (Neg) • obligatory fronting of PP verb complement

• in P2 • comparatives • coordinate/subordinate clauses • verbless coordinated phrases • obligatory fronting of 2nd verb phrase

• in P3 (no subclasses)

8

Page 9: Rassi et-al propor-2014

Formal classes

9

4 A. Rassi, J. Baptista and O. Vale

number of propositions (one, two, or three clauses or clause-like units); (ii) coor-dination (in multiple-clause proverbs); (iii) order of the main vs the subordinateclauses (in multiple-clause proverbs); (iv) order of the constituents (in single-clause proverbs); (v) impersonal constructions; and (vi) obligatory negation.

Proverbs that did not fit in any of the categories above were added in aresidual class. Table 1 shows the breakdown of the proverbs (base-forms) perclass.

Table 1. Formal Classification of Brazilian Portuguese Proverbs

Class Structure Example (approximate translation) Count

P1F1 Ø V w Nao ha parto sem dor 20(impersonal) ‘There is no painless childbirth’

P1F2 N

0

V cop Adj/N w O silencio e de ouro 53‘Silence is golden’

P1F3 N

0

V w Uma mao lava a outra 80‘One hand washes the other’

P1F4 N

0

Neg V w Cao que ladra nao morde 53‘A barking dog seldom bites’

P1F5 Prep Ni N0

V w Em terra de cego, quem tem um olho e rei 45‘In the land of the blind, the one-eyed is king’

P2F1 F

1

Conjs-comp F

2

Antes so que mal acompanhado 39(comparatives) ‘Better alone than in bad company’

P2F2 F

1

Conjc F

2

Aqui se faz e aqui se paga 71(coordinated) ‘What goes around comes around’

P2F3 NP1

, NP2

Cada cabeca, uma sentenca 48‘Each head its sentence’

P2F4 Qu- F1

F

2

Quem ri por ultimo ri melhor 90(subordinated) ‘Who laughs last laughs best’

P2F5 F

1

Conjs F

2

Pense duas vezes antes de agir 20(subordinated) ‘Look before you leap’

P2F6 Conjs F

2

, F

1

Quando o gato sai de casa, os ratos fazem festa 28(fronted subord.) ‘When the cat’s away, the mice will play’

P3 F

1

, F

2

, F

3

Maos frias, coracao quente, amor ardente 24‘Cold hands, warm heart, burning love’

Residual not specified Comer e cocar e so comecar 43‘To keep eating and scratching, just start’

Total 614

In this Table, the left column shows the conventional codes for designat-ing each class; the structure of the proverbs’ class is indicated as follows: Adjfor adjective; Conjc for coordinative conjunctions; Conjs for subordinative con-junctions; Conjs-comp specifically for comparative conjunctions; F1, F2 and F3

for the first, second and third clause (or clause-like units), respectively; N0 forthe subject; Ni for a noun in any syntactic position; NP1 and NP2 for nomi-

Page 10: Rassi et-al propor-2014

Core elements!

• Noun phrases (NP), subject (N0) or complement (N1): • noun (N) or pronoun (PRO) • adjective (Adj) • eventual determiners (Det) or modifiers (Mod) !

• Verbal phrases (VP): • main verb (V) • eventual auxiliaries (Aux) • adverbial modifiers (Mod)

10

Page 11: Rassi et-al propor-2014

Graphs and Transducers!

Quem conta um conto aumenta um ponto !‘Who tells a tale adds a point’

!!!!

Example of a reference graph for P2F4 class !!!!!

Example of a FS transducer for proverb 0023 in P2F4 class

11

Page 12: Rassi et-al propor-2014

Concordance

12

[proverb  ID

=core  elements]matched  string

Page 13: Rassi et-al propor-2014

74.7

13

10 A. Rassi, J. Baptista and O. Vale

Table 3. Results of automatic identification of proverbs by class

Class Proverbs Matches Types True-Positives False-Positives

P1F1 20 15 4 13 2P1F2 53 91 21 75 16P1F3 80 153 24 98 55P1F4 53 61 15 61 0P1F5 45 63 5 57 6P2F1 39 40 7 39 1P2F2 71 14 3 5 9P2F3 48 40 8 15 25P2F4 90 56 37 30 26P2F5 20 3 1 3 0P2F6 28 1 1 1 0P3 24 0 0 0 0Residual 43 20 8 19 1Total 614 557 134 416 141

We emphasize two advantages for the syntactic classification proposal: (i) thedefinition of an adequate extent of a window for insertions (words and punctua-tion), which vary depending on the formal class; and (ii) each class has specificproperties, so we could apply specific transformations in each class, e.g. themirror-permutation (P1F2 class), and the zeroing of negation elements (in P1F4class) or of the main clause in double-clause classes (P2F1 and P2F2 classes).

Furthermore, we point out that (i) the methodology here presented to identifycore elements’ proverb is mostly language-independent and can be replicated forlarger databases than the one used in this work; and (ii) finite-state automatacan be applied to lexical matrices for automatic extraction of the proverbs andtheir variants in large corpus.

In future works, we intend to annotate a corpus or a sample of PLN.BrFull, aiming to compare the results between the automatic task and the humanannotation, and then evaluate the work in terms of recall and F-measure. Inthe same way as we varied the window lenght by class, we want to experimentvarying this lenght in di↵erent syntactic slots within the same structure and,eventually, determine which proverbs are more or less prone to formal variation.

For future works, we also intend to expand the database to include proverbsfrom European Portuguese [19], and then automatically build a proverbial databaseby using finite state automata and local grammars with discoursive markers thatintroduce proverbs and variants.

Acknowledgments. This work was supported by national funds through FCT -Fundacao para a Ciencia e a Tecnologia, under project PEst-OE/EEI/LA0021/2013and by Capes/PDSE under Process BEX 12751/13-8. The authors would alsolike to thank the comments of the anonymous reviewers, which helped to improvethis paper.

Precision  =  74.7  %  

Page 14: Rassi et-al propor-2014

Error analysis• Specific subsets in P2F4 class:

Quem <MOT>* V <MOT>* V! Quem tem boca vai a RomaQuem <V> <V> Quem cala consente

!• Constraints on V tense:

Quem(<V:P3s>+<V:J3s>+<V:W>)(<V:P3s>+<V:F3s>+<V:W>)

P2F4 class Matches FP Precision (P2F4)

Precision (all classes)

Quem <MOT>* V <MOT>* V 276 200 27.5% 60.15%

Quem V V (no insertions) 56 26 53.57% 73.55%14

Page 15: Rassi et-al propor-2014

Discussion - New variants• The matches found allowed us to identify other

variants of the same proverb that were not in the initial list:

!Antes tarde do que nunca

‘Better later than never’

15

new  variants

Page 16: Rassi et-al propor-2014

Discussion (cont.) - New proverbs!• It was also possible to find proverbs

that were not in the previous list. !

P2F4 class: quem V V ‘who V V’ !

Quem sabe faz ‘Who knows makes’

!Quem sabe faz ao vivo ‘Who knows makes it viva’

16

Page 17: Rassi et-al propor-2014

Discussion (cont.) – Window insertion length!

• The length of the insertion window can vary, depending on the type of proverb involved (in general, at maximum 5 words). !!

O buraco [das negociações com o Congresso] é muito mais embaixo ‘the hole [in negotiations with Congress] is much more down’ !a justiça [que o brasileiro tanto almeja] começa dentro de casa ‘the justice [that the Brazilian so much craves] begins at home’

17

Page 18: Rassi et-al propor-2014

Discussion (cont.) – Separators!

• In Portuguese proverbs, the use of comma is not systematic, and in many cases it can be considered to be optional.

• The reference graphs allow the facultative presence of punctuation between the core words. !

Quem sai ao vento (,) perde o assento (comma facultative) ‘Who leaves to the the wind, loses the seat’ !Quando a esmola é demais (,) o santo desconfia (comma facultative) ‘When the alms are too much, the saint suspects’

18

Page 19: Rassi et-al propor-2014

Discussion (cont.) – Transformations!• Some proverbs of P1F2 class

allow a mirror permutation

O ataque é a melhor defesa [Mirror Permut.]= A melhor defesa é o ataque ‘The attack is the best defense = The best defense is the attack’

19

Page 20: Rassi et-al propor-2014

Discussion (cont.) - Negation!• The negation may not be considered an obligatory

element — wordplay often involves the removal of this negation, to produce some type of effect:

!Beleza não põe mesa ‘Beauty does not set the table’ !Como a maioria das outras entrevistadas, Astrid diz que beleza põe mesa, sim ‘Like most other interviewees, Astrid says that beauty does set the table’

20

Page 21: Rassi et-al propor-2014

Discussion (cont.) - Implicit clauses!• Some proverbs in P2F2 class, formed by two

propositions, may result from coordinating two simple proverbs with one proposition each: !

Quem casa não pensa, quem pensa não casa‘Who gets married doesn‘t think, who think doesn‘t get married’ !Quem casa não pensa‘Who gets married doesn‘t think’ !Quem pensa não casa‘Who think doesn‘t get married’

21

Page 22: Rassi et-al propor-2014

Synopsis(1) the formal (syntactic) classification of proverbs in 13

classes: this classification may serve as a starting point for deeper analysis on each one of these proverbial structures;

(2) the identification of the core elements of each proverb: the methodology presented to extract keywords can be replicated for other corpora in order to check different text types and domains;

(3) the definition of an adequate length for insertions’ window (words and punctuation), which may vary depending on the class of proverbs

22

Page 23: Rassi et-al propor-2014

Thank you! Questions, please! 23