Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona)

How specialized are specialized corpora?

Behavioral evaluation of corpus representativeness for Maltese

Jerid Francom (Wake Forest University)Adam Ussishkin (University of Arizona)

Amy LaCross (University of Arizona)

19 May 2010: O7 (Evaluation of Methodologies), 14.45-15.05LREC 2010, Mediterranean Conference Center

Valletta, Malta

2

Acknowledgements

Generous contribution of data to this project by Dr. Albert Gatt (Univ. of Malta)

Statistical expertise from Jeff Berry (Univ. of Arizona)

Funding from the United States National Science Foundation (BCS-0715500) to Adam Ussishkin

3

GoalsIssueFor many languages, the quality of available textual data is less than ideal for corpus creation in the light of standard sampling practices.

ProposeBehavioral data can provide a valuable metric to evaluate corpus resources otherwise considered ‘specialized’.

CasePsyCoL Maltese Lexical Corpus

ContributeNovel, cross-discipline metric for evaluating the quality of language resources

4

Sparse coverageMost of the world’s 5-7000 languages have no corpus resources

Efforts to fill the gap, often exploit the availability of language data on the web

An Crúbadán project, 446 languages (Scannell, 2007)

McEnery et al., (2006) survey of recent work

5

Sparse coverageLow-density languages (Borin, 2009)Languages in which resources exist; but in limited quantity/quality

Limited access to print and/or electronic data

Available primary data may be less-than-representative

Weakens assurance that results from low-density language resources are credible

6

Corpus representativenessWhat is a ‘representative corpus’?

An externally valid sample of language use

A sample that approximates what the language is.

Full range of structural types (language units)

What are the characteristics of such a sample?

Genre/register

Modality

7

An issue for low-density languages

Standard practice to achieve representativeness

Apply rigorous sampling methods

Collect large amounts of data

Problematic for low-density languages: a representativeness bottleneck

Lack large amounts of data

Available data is often limited in register, modality, etc.

Corpus resources are typically specialized

8

Assessing representativeness

How do we know whether we have a ‘representative’ sample?

We don’t, in an absolute sense.

Faith in survey sampling practicesCasting the net far and wide

Can we be assured we don’t have a representative sample?

Not exactly.

It is logically possible that smaller, less diverse samples are externally valid for linguistic units that appear in the collection.

9

Proposal

Need for an external metric.

Current proposal suggests findings from behavioral experimentation can provide a valuable metric to evaluate corpus resources.

Exploit the correlation between derived frequency counts and elicited behavioral reactions

Behavioral data and adjusted frequency (Gries 2008; 2009)

Of particular importance for specialized corpora

10

Behavioral findings

Well-known robust effects for relative frequency in language processing

Word naming RTs (e.g., Forster & Chambers, 1973)

Lexical decision RTs (e.g., Carroll & White, 1973)

Sentence reading RTs (e.g., MacDonald, 1994)

Word familiarity ratings (e.g., Gernsbacher 1984)

Log frequency is a good predictor of behavior.

11

Approach

Evaluating corpus representativeness through behavioral assessment

1. Derive frequency counts from a specialized corpus

2. Elicit behavioral response of participants from target population

3. Assess correlation strength: how well do behavioral responses correlate with corpus measures?

12

Case study and predictionsCase study

Calculate: log frequency of subset of items in a Maltese lexical corpus

Measure: subjective word familiarity ratings of native speakers of Maltese

Assess: relative distribution of the measures

Prediction

Congruence between relative distributions indicates a representative sample of the language

Mismatches underscore potential sampling issues

13

The specialized corpusPsyCoL Maltese Lexical Corpus (PMLC)(Francom, Ussishkin, and Woudstra, 2009)http://psycol.sbs.arizona.edu/resources/

Online Maltese newspapers, 1998-1999; 2005 - 2007PsyCoL lab (59.8%) and Dr. Albert Gatt (40.2%)

3,323,325 total tokens (53,000 unique)Token/type ratio of 1.6%

Typical for low-density languages

Large corpus, still relatively small (cf. British National Corpus 100+million; Corpus of Contemporary American English 400+ million)

Limited in register, modality

http://psycol.sbs.arizona.edu/resources/

14

Linguistic variable to quantifyBecause there is little previous quantitative research on Maltese, the empirical focus of this investigation was narrowed to:

Semitic-origin verbs/binyanim (also known as form)

Semitic-origin verbs in Maltese conform to the classical Semitic binyan system (categories based on morphosyntactic and phonological properties)

Question: How does frequency as measured in our corpus correlate with behavior?Can the binyan categories be exploited to provide correlations?

15

Maltese binyanim

Binyan Function Prosodic shape Example

1 basic active (transitive or intransitive) CVCVC kiser ‘he broke’

2 intensive of 1, transitive of 1 CVCCVC kisser ‘he smashed’

3 transitive of 1 CV:CVC bi:rek ‘he blessed

5 passive of 2, reflexive of 2 tCVCCVCtkisser ‘it got

smashed’

6 passive of 2, reflexive of 3 tCV:CVCtki:teb ‘he

corresponded’

7 passive of 1, reflexive of 1 nCVCVCnkiser ‘it got

broken’

8 passive of 1, reflexive of 1 CtVCVCftakar ‘he

remembered’

9 inchoative, acquisition of a quality CCV:C hma:r ‘he blushed’

10 originally inchoative stVCCVC stenbah ‘to wake’

16

A behavioral task: word familiarity

• We devised three tests to measure corpus representativeness

• Each test measured a different aspect of our corpus counts and our behavioral task.

• The behavioral task involved native Maltese-speakers, who gave subjective word familiarity ratings for all Semitic-origin Maltese verbs taken from Aquilina (2000); n=1536.

Scale from very unfamiliar to very familiar

Shown to be a reliable predictor of lexical processing (Connine et al. 1990)

17

Word familiarity experiment

Participants

107 native speakers of Maltese

Task

Subjective word familiarity task, online

18

Measuring frequency in the corpus

• We then used the PMLC to calculate word frequency measures for the same set of verbs.

• Using regular expression-enabled searching, we counted token frequency for all verbs occurring in the PMLC (n=447).

• Frequency was then encoded as a log-based measure.

19

Three tests• Next, we conducted three distinct

statistical analyses to assess correlation between these corpus measures and the results of our word familiarity experiment

• 1. Statistical regression between corpus log frequency and behavioral data.

• 2. Binned groups by frequency to determine whether any correlation is found.

• 3. Binned items by binyan to determine whether any correlation is found.

20

1. Statistical regression• We found a weak correlation (r=.14);

these results show at best a trend toward correlation, but suggests that familiarity ratings likely do not predict word frequency given these results.

21

2. Binning by frequency• Binning into two bands shows a

correlation:

• Binning into three bands also shows a correlation:

22

2. Binning by frequency• An LMER analysis of each binning (2

groups and 3 groups) shows significance:

• All contrasts for two-bin intervals (High/Low=4.2, t=2.0) and three-bin intervals (High/Mid=7.1, t=3.9; Mid/Low=7.0, t=2.2) were significant.

• These results support the hypothesis that behavior and corpus measures are correlated.

23

3. Binning by binyan• Earlier and ongoing work (Frost et al.

1997, 1998, 2000; Ussishkin et al. in progress) shows binyan effects in Hebrew in both visual and auditory modalities, so Maltese could be expected to show similar effects.

• Our goal here is to measure whether verbs, when grouped by binyan, show a correlation between word frequency measures and word familiarity ratings.

24

3. Binning by binyan• Only binyanim 1, 2, 5, 7 were analyzed;

binyanim 3, 6, 8, 9, and 10 were not included in the analyses because they are so sparsely populated:

25

3. Binning by binyan• Word frequency results: significant

contrasts found between Binyanim 7 and 2 (β=.54, t=6.0); and between Binyanim 7 and 5 (β=1.15, t=-2.2).

• Word familiarity results: no significant contrasts found.

Binyan by word frequency Binyan by word familiarity

26

General assessment• The results show that verb frequency

distributions in the PMLC pattern to some degree with the psychological representations of native speakers (the representative population)

• On the surface suggests the PMLC is on the right track, but underscores the specialized nature of corpus

• However, a response bias in the word familiarity task may play a part in the mismatches

• Ceiling effect may have contributed to lower correlation scores

27

General assessment• Reasons to be optimistic about

the verb distributions in the PMLC:

• Distribution of verb count/ frequency (Zipf, 1949)

• Distribution of word length/ frequency (Li, 1992)

• Both measures trend as expected for representative samples

28

Conclusion• Novel methodology: direct comparison

between corpus resource and behavior.

• Highlighting a robust effect from psycholinguistics (frequency of linguistic units predicts behavior).

• We predicted the opposite could occur; this provides a way to validate LDL resources.

• This approach encourages cross-discipline endeavors for resource development and theoretical investigation.

29

• Thank you very much!• Grazzi ħafna!

Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona)

Documents