Top Banner
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas
22

Collocations and Terminology

Jan 05, 2016

Download

Documents

garren

Collocations and Terminology. Vasileios Hatzivassiloglou University of Texas at Dallas. Collocations. Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics , 1993 Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Collocations and Terminology

Collocations and Terminology

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Collocations and Terminology

Collocations

• Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993

• Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning

• Technical and non-technical

Page 3: Collocations and Terminology

Examples of collocations

• The Dow Jones average of industrials

• The Dow average

• The Dow industrials

• *The Jones industrials

• The Dow Jones industrial

• *The industrial Dow

• *The Dow industrial

Page 4: Collocations and Terminology

Collocation properties

• Arbitrary (dialect dependent)– ride a bike, set the table

• Domain dependent– dry suit, wet suit

• Recurrent

• Cohesive– Part of a collocation primes for the rest

Page 5: Collocations and Terminology

Applications

• Lexicography

• Grammatical restrictions (compare with/to but associate with)

• Generation

• Translation

Page 6: Collocations and Terminology

Types of collocations

• Predicative relations– make a decision, hostile takeover– flexible (syntactic variability, intervening

words)

• Rigid word groups– over the counter market

• Phrases with open slots– fluency in a domain

Page 7: Collocations and Terminology

Issues in finding collocations

• Possibly more than two words– Need measure that extends beyond the binary

case

• Possibly intervening words

• Possibly morphological and syntactic variation

• Semantic constraints (cf. doctors-dentists and doctors-hospitals)

Page 8: Collocations and Terminology

Xtract stage one

• For a given word, find all collocates at positions -5 to +5

• Three criteria:– strength (normalized frequency); 95% rejection

vs. expected 68% under normal distribution– position histogram must not be flat– select peak from histogram

Page 9: Collocations and Terminology

Xtract stage two

• Start from word pairs

• Look at each position in between, to the left, and to the right

• Keep words that appear very often

• If that fails, keep parts of speech that satisfy this criterion

Page 10: Collocations and Terminology

Xtract stage three

• Applied to pairs of words

• Requires (partial) parsing

• Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)

Page 11: Collocations and Terminology

Evaluation

• Ask lexicographer to evaluate output

• 40% precision after stages one and two

• 80% precision after stage three

• 94% conditional recall

Page 12: Collocations and Terminology

Terminology

• Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994

• Terms refer to concepts

• Terms key for populating a domain ontology

• Terms are typically nominal compounds of certain structure, e.g., NN, N of N

Page 13: Collocations and Terminology

Defining terms

• Unique reference

• Unique translation

• Term extension by– modification (e.g., addition of an adjective)– substitution– extension of structure– coordination

Page 14: Collocations and Terminology

Algorithm

• Apply syntactic constraints to match pairs of words in a candidate term

• Filter by application of an association measure

• Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio

Page 15: Collocations and Terminology

Observations

• Compare with reference list

• Frequency a strong predictor

• Log-likelihood ratio works best

• Additional criteria:– diversity of the distribution of each word– distance between the two words (determines

flexibility but not term status)

Page 16: Collocations and Terminology

Justeson and Katz

• Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.

Page 17: Collocations and Terminology

Analysis

• Examined association measures

• Well-known problems:– eliminating general-language constructs (e.g.,

collocations)– what to do with single word terms?

Page 18: Collocations and Terminology

Observations

• Frequency works well

• But a stronger predictor is P(k>1) compared to P(k≥1) in the same document

• Use syntactic patterns to propose terms, then check if they reappear in the same document

• Require this across multiple documents

Page 19: Collocations and Terminology

Term Expansion

• Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.

• Need to expand a given list of terms, especially for scientific domains

Page 20: Collocations and Terminology

Term variation

• Syntactic (same words, different structure)

• Morphosyntactic (derivational forms of words)

• Semantic (synonyms are used)

• In IR, normalization through stemming and removal of stop words

Page 21: Collocations and Terminology

Approach

• Process corpus matching new candidate terms to old ones via unification

• Matching based on– inflectional morphology (transducer)– derivational morphology (rule-based)– syntactic transformations– additions of words

Page 22: Collocations and Terminology

Results

• Manual inspection of several thousand proposed terms

• Precision of 89%

• Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)