TOWARDS A CORPUS-BASED ONLINE DICTIONARY OF ITALIAN WORD COMBINATIONS The CombiNet project SARA CASTAGNOLI FRANCESCA MASINI (UNIVERSITY OF BOLOGNA) MALVINA NISSIM (UNIVERSITY OF GRONINGEN) GIANLUCA E. LEBANI ALESSANDRO LENCI (UNIVERSITY OF PISA) ENeL meeting @ Herstmonceux Castle, 13 August 2015 VALENTINA PIUNNO (UNIVERSITY OF ROMA TRE)
18
Embed
TOWARDS A CORPUS-BASED ONLINE DICTIONARY OF ITALIAN WORD COMBINATIONS The CombiNet project SARA CASTAGNOLI FRANCESCA MASINI (UNIVERSITY OF BOLOGNA) MALVINA.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TOWARDS A CORPUS-BASED ONLINE DICTIONARY OF
ITALIAN WORD COMBINATIONS
The CombiNet project
SARA CASTAGNOLI
FRANCESCA MASINI
(UNIVERSITY OF BOLOGNA)
MALVINA NISSIM
(UNIVERSITY OF GRONINGEN)
GIANLUCA E. LEBANI
ALESSANDRO LENCI
(UNIVERSITY OF PISA)
ENeL meeting @ Herstmonceux Castle, 13 August 2015
VALENTINA PIUNNO
(UNIVERSITY OF ROMA TRE)
THIS PRESENTATION
• INTRODUCING CombiNet, an ongoing project aimed at building a corpus-based, lexicographic resource for Italian Word Combinations (Universities of Roma Tre, Pisa, Bologna)
• an innovative resource for the Italian language• relevance for ENeL-WG3:
• an electronic resource• an integrated computational-lexicographic approach:
1) automatic extraction of candidate WoCs from corpora2) manual evaluation and compilation
• OUTLINE: • our view of Word Combinations (WoCs)• AKA: extracting WoCs from corpora – methods• evaluation of AKA: automatic and manual 3
WORD COMBINATIONS (WoCs)
The whole range of combinatory possibilities associated with a word, including:
•Multiword Expressions (MWEs), i.e. a variety of WoCs characterised by different degrees of fixedness and idiomaticity that act as a single unit at some level of linguistic analysis, e.g.:
• idioms• phrasal lexemes
•More abstract combinations, i.e. the distributional properties of a word at the level of e.g.:
VER DET (ADJ) NOUNcostruire un piccolo impero‘build a small empire’
Using SYNTACTIC INFO(S-BASED methods)
- parsed corpus- list of syntactic relations
SUBJ – VERBguerra – scoppiare‘war – burst’
VERB – OBJperdere – vista‘lose – (one’s)sight’
VERB – COMP_DIparlare – di sport‘talk – about sport’
COMPARING EXTRACTION METHODS
- satisfactory results for relatively fixed | adjacent | short WOCs
- also target discontinuous and syntactically flexible WoCs
6
Using POS PATTERNS(P-BASED methods)
Using SYNTACTIC INFO(S-BASED methods)
- patterns need to be specified a priori
- noise, even after applying AMs- cannot capture complex and
flexible WOCs- dismissing abstract
combinatory information (e.g. argument structure)
- abstracting away from information such as linear order, morphosyntactic features etc.
- no information about how exactly words combine
- cannot distinguish frequent but productive combinations, from idiomatic ones with the very same syntactic structure
Castagnoli et al. 2015; Lenci et al. 2014, 2015
AUTOMATIC EXTRACTION OF CANDIDATE WoCs - DATA
• La Repubblica corpus (Baroni et al. 2004)
• approx. 380M tokens, POS-tagged and dependency parsed• “clean” corpus, but only newspaper language
• POS-based extraction:• 122 POS sequences deemed representative of Italian WoCs, in 3
subsets (nominal, verbal, prepositional WoCs)• Independent extraction rounds, using the EXTra tool
• contiguous sequences, no optional slots, LL ranking, freq>5
• Syntax-based extraction:• distributional profiles, containing the syntactic slots (subject,
complements, modifiers, etc.) and the combinations of slots (frames) with which words co-occur, abstracted away from their surface morphosyntactic patterns
• each slot is associated with lexical sets formed by its most prototypical fillers
1) All sequences corresponding to the mentioned patterns are extracted from the corpus.
•2) Lists of candidate WoCs are filtered to extract lines containing specific Target Lemmas (i.e. future headwords)
• Headwords: “fundamental” 2,100 words from the Senso Comune lexicon (http://www.sensocomune.it/)
• Nouns, Verbs, Adjectives
•3) Lexicographers are provided with structured lists:
• lemmatised candidate WoCs for a given TL• ranked according to their LL score• raw frequency of each combination in the corpus• underlying POS pattern or syntactic relation
8
POS-BASED DATA
9
POS-BASED DATA
10
SYNTAX-BASED DATA
11
LEXICOGRAPHERS’ USE OF DATA
• Candidate lists for each TL are imported into a spreadsheet.
• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.
12
13
14
LEXICOGRAPHERS’ USE OF DATA
• Candidate lists for each TL are imported into a spreadsheet.
• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.
• Candidates considered as valid WoCs are manually selected
• and edited
• before being recorded in the relevant part of the lexicographic record
15
LEXICOGRAPHERS’ EVALUATION - 1
(“highly impressionistic feedback from our lexicographers”)
•LL ranking is generally helpful, as most higher-ranking candidates represent (or contain, or suggest) proper WoCs which deserve inclusion in the dictionary.
• However, difficult to set thresholds, since WoCs which they would intuitively include in the entry also appear in the middle and lower part of the ranking.
•POS-based data are more useful to compile the entries for nominal and adjectival TLs, whereas SYNTAX-based data would be more helpful for verbal TLs.
• No systematic evidence provided.
16
AUTOMATIC EVALUATION - 1
• We tested and compared the performance of the two extraction methods using an existing Italian combinatory dictionary as a benchmark (25 TLs).