REFERENTIAL CHOICE: FACTORS AND MODELING Andrej A. Kibrik, Mariya V. Khudyakova, Grigoriy B. Dobrov, and Anastasia S. Linnik [email protected] Night Whites.

1

REFERENTIAL CHOICE:

FACTORS AND MODELING

Andrej A. Kibrik, Mariya V. Khudyakova, Grigoriy B. Dobrov, and Anastasia S. Linnik

[email protected]

Night Whites SPbFebruary 28, 2014

mailto:[email protected]

222

Referential choice in discourse

When a speaker needs to mention (or refer to) a specific, definite referent, s/he chooses between several options, including: Full noun phrase

• Proper name (e.g. Peter)• Description = common noun (with or without

modifiers) (e.g. the tzar)• Mix: Peter the Great

Reduced NP, particularly a third person pronoun (e.g. he)

3

Example

The Victorian house that Ms. Johnson is inspecting has been deemed unsafe by town officials. But she asks a workman toting the bricks from the lawn to give her a boost through an open first-floor window. Once inside, she spends nearly four hours Ø measuring and diagramming each room in the 80-year-old house, Ø gathering enough information to Ø estimate what it would cost to rebuild it. She snaps photos of the buckled floors and the plaster that has fallen away from the walls.

Description Proper name Pronoun

Zero

4

Research question

How is referential choice made?

5

Why is this question important?

Reference is among the most basic cognitive operations performed by language users

Reference constitutes a lion’s share of all information in natural communication

Consider text manipulation according to the method of Biber et al. 1999: 230-232

6

Referential expressions marked in green

The Victorian house that Ms. Johnson is inspecting has been deemed unsafe by town officials. But she asks a workman toting the bricks from the lawn to give her a boost through an open first-floor window.

7

Referential expressions removed

The Victorian house that Ms. Johnson is inspecting has been deemed unsafe by town officials. But she asks a workman toting the bricks from the lawn

to give her a boost through an open first-floor window.

8

Referential expressions kept

The Victorian house that Ms. Johnson is inspecting has been deemed unsafe by town officials. But she asks a workman toting the bricks from the lawn to give her a boost through an open first-floor window.

9

Types of referential devices: levels of granularity

We mostly concentrate

on the two upper levels

in this hierarchy

◘╕

REG tradition:

most attention

to varieties of descriptive

full NPs

101010

Multi-factorial character of referential choice

Multiple factors of referential choice Distance to antecedent

Along the linear discourse structure (Givón) Along the hierarchical discourse structure

(Fox, Kibrik)

Antecedent role (Centering theory) Referent animacy (Dahl) Protagonisthood (Grimes)

.........................................

Properties of the discourse context

Properties of the referent

11

Cognitive multi-factorial model of referential choice

Discourse context

Referent activation in working memory

Referent’s properties

Referential choice

Factors of referential

choice

12

Rhetorical distance

Distance along the hierarchical discourse structure between the current point in discourse, where referential choice

is to be made the antecedent

Measured in elementary discourse units roughly equaling clauses

Rhetorical structure theory by Mann and Thompson (RST)

Very important factor RST Discourse Treebank corpus (Marcu et al.)

13

Example of a rhetorical graph from RST Discourse Treebank

14

RefRhet and MoRA

RST Discourse Treebank + our annotation = RefRhet corpus Subcorpus RefRhet 3 (2013-2014)

Annotation scheme MoRA (Moscow Referential Annotation)

15

RefRhet 3

64 texts6294 markables1852 anaphor-antecedent pairs

475 pronouns 1377 full NPs

•706 descriptions•671 proper names

16

Candidate factors of ref. choice

Some values are drawn from

MoRA annotation

Some other are computed

automatically

Factor-predicted variable

╕◘

Discourse context

17

Windows of the MMAX2 program

18

Some properties of the MoRA scheme

Wide range of activation factors and their values E.g. multiple values of the “grammatical role”

factor

Annotation of groups complex markables serving as antecedents

• and-coordinate• or-coordinate• prepositional (children with their parents)• discontinuous

19

A discontinuous group

20

Tasks for machine learning

Candidate factors: All potential parameters implemented in corpus annotation

Factor-predicted variable: Form of referential expression (np_form)

Two-way task: Full NP vs. pronoun

Three-way task: Definite description vs. proper name vs. pronoun

Accuracy maximization: Ratio of correct predictions to the overall number of

instances

212121

Machine learning methods (Weka, a data mining system)

Logical algorithms • Decision trees (C4.5)• Decision rules (JRip)

Logistic regressionCompositions

Boosting Bagging

Quality control – the cross-validation method

22

Results of machine learning on RefRhet 3 and MoRA

Algorithm Accuracy two-way

Accuracy two-way

(2012)

Accuracy three-way

Baseline (frequency of the most common ref. option)

74,4% 74,4% 37,9%

Logistic regression 87,2% 71,3%

Decision tree algorithm

93,7% 86,1% 74,0%

Bagging 89,4% 88,0% 76,1%

Boosting 89,5% 86,2% 74,0%

23

Non-categorical referential choice (Kibrik 1999)

min Referent activation max

Cognitive plane: graded variable

Linguistic plane: binary variable

full NPPeter

pronounhe

24

Non-categorical referential choice

In many instances, more than one referential options can be used

Referential choice is less than fully categorical (cf. Belz & Varges 2007, van Deemter et al. 2012: 173–179)

In the intermediate activation instances both the original text author and the algorithm: more or less randomly make a categorical decision at

the linguistic plane those decisions do not have to always coincide

Therefore, no model can predict the actual referential choice with 100% accuracy

25

Experiment: Understanding (allegedly non-categorical) referential expressions

9 texts, in which the algorithms have diverged in their prediction from the original referential choice

9 original texts (proper name) and 9 altered texts (pronoun) distributed between 2 experimental lists

60 participants 1 experimental question + 2 control question If the instances of divergence are explained by

intermediate referent activation, the accuracy in experimental questions should not be lower than the accuracy in control questions

25

26

Control questions – 84% Questions to proper names – 84% Questions to pronouns – 75% If we exclude questions #2 and #5, then the accuracy for

questions to pronouns is 80%, not differing significantly from control and PN questions

In general, the algorithm diverges from the original in the places where that is acceptable, that is, referent activation is intermediate

Experiment: results

26

27

Non-categorical referential choice

Sometimes referential choice allows more than one option

A proper model of referential choice must account for this property of human speakers

Our modeling procedures actually conform to this requirement

28

Further studies

Explore logistic regression’s ability to evaluate the certainty of prediction and attempt to correlate that with the human’s

assessment of non-categorical referential choice as well as with the theoretical notion of

intermediate referent activation Cheap data modeling Secondary referential options, such as

demonstrative descriptions Genres and referential choice

29

Conclusions

Multi-factorial approach Corpus large enough for machine-learning

modeling Results of prediction close to theoretical

maximum Account of the non-deterministic character

of referential choice This approach can be applied to a wide

range of other linguistic choices

30

Thank you

for your attention

REFERENTIAL CHOICE: FACTORS AND MODELING Andrej A. Kibrik, Mariya V. Khudyakova, Grigoriy B. Dobrov, and Anastasia S. Linnik [email protected] Night Whites.

Documents

referent slide

nps slide

rst discourse treebank

referential expressions

discontinuous group

parents discontinuous

mmax2 program slide

discourse context properties