A Statistical Approach to Anaphora Resolution

Post on 07-Apr-2018

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 1/10

A S t a t i s t i c a l A p p r o a c h t o A n a p h o r a R e s o l u t i o n

N i y u G e , Jo h n H a l e a nd E u g e n e C h a r n i a k

Dept. of Computer Science,Brown University,

[ n g e I h [ c ] ~ c s . b r o w n , e d u

A b s t r a c t

This paper presents an algorithm for identi-

fying pronominal anaphora and two experi-

ments based upon this algorithm. We incorpo-

rate multiple anaphora resolution factors intoa statistical framework - - specifically the dis-

tance between the pronoun and the proposed

antecedent, gender/number/animaticity of the

proposed antecedent, governing head informa-

tion and noun phrase repetition. We combine

them into a single probability that enables us

to identify the referent. Our first experiment

shows the relative contribution of each source

Of information and demo nst rate s a success rate

of 82.9% for all sources combined. The second

experiment investigates a method for unsuper-

vised learning of gender/number/animaticityinformation. We present some exper iment s il-

lustrating the accuracy of the method and note

that with this information added, our pronoun

resolution meth od achieves 84.2% accuracy.

1 I n t r o d u c t i o n

We present a statistical method for determin-

ing pronoun anaphora. This prog ram differs

from earlier work in its almost complete lack of

hand-crafting, relying instead on a very small

corpus of Penn Wall Street Journal Tree-bank

text (Marcus et al., 1993) that has been markedwith co-reference information. The first sections

of this paper describe this program: the proba-

bilistic model behind it, its implementation , and

its performance.

The second half of the paper describes a

method for using (portions of) t~e aforemen-

tioned program to learn automatical ly the typi-

cal gender of English words, information that is

itself used in the pronoun resolution program.

In particular, the scheme infers the gender of a

referent from the gender of the pronouns that

161

refer to it and selects referents using the pro-

noun anaphora program. We present some typ-

ical results as well as the more rigorous results

of a blind evaluation of its o utp ut.

2 A P robab i l i s t i c Mode l

There are many factors, both syntactic and se-

mantic, upon which a pronoun resolution sys-

tem relies. (Mitkov (1997) does a detailed st udy

on factors in anap hora reso lution.) We first dis-

cuss the training features we use and then derive

the probability equations from th em.

The first piece of useful information we con-

sider is the distance between the pronoun

and the candidate antecedent. Obviously the

greater the distance the lower the probability.

Secondly, we look at the syntactic situation inwhich the pronoun finds itself. The most well

studied constraints are those involving reflexive

pronouns. One classical approach to resolving

pronouns in text that takes some syntactic fac-

tors into consideration is that of Hobbs (1976).

This algorithm searches the parse tree in a left-

to-right, breadth-first fashion that obeys the

major reflexive pronoun constraints while giv-

ing a preference to antecedents that are closer

to the pronoun. In resolving inter- senten tial

pronouns, the algorithm searches the previous

sentence, again in left-to-right, breadth-first or-der. This implements the observed preference

for subject position antecedents.

Next, the actual words in a proposed noun-

phrase antecedent give us information regard ing

the gender, number, and animatici ty of the pro-

posed referent. For example:

M a r i e G i r a u d c a r r i es h i s t o r ic a l s i g -

n i f ic a n c e a s o n e o f t h e l a s t w o m e n t o

b e e z e c u t e d i n F r a n c e . S h e b e c a m e

a n a b o r t i o n i s t b e c a u s e i t e n a b l e d h e r t o

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 2/10

b uy ja m , co co a a n d o th er w a r - ra t io n ed

goodies.

H e r e i t is h e l p f u l t o r e co g n i z e t h a t " M a r i e " i s

p r o b a b l y f e m a l e a n d t h u s i s u n l i k e l y t o b e r e -f e r r e d t o b y " h e " o r " i t " . G i v e n t h e w o r d s in t h e

p r o p o s e d a n t e c e d e n t w e w a n t t o f i n d t h e p r o b -

a b i l it y t h a t i t i s t h e r e f e r e n t o f t h e p r o n o u n i n

q u e s t i o n . W e c o ll e c t t h e s e p r o b a b i l i t i e s o n t h e

t r a i n i n g d a t a , w h i c h a r e m a r k e d w i t h r e f e r e n c e

l in k s . T h e w o r d s in th e a n t e c e d e n t s o m e t i m e s

a l so le t u s t e s t f o r n u m b e r a g r e e m e n t . G e n e r -

a l l y , a s i n g u l a r p r o n o u n c a n n o t r e f e r t o a p l u r a l

n o u n p h r a s e , s o t h a t i n r e s o l v i n g s u c h a p r o -

n o u n a n y p l u ra l c a n d i d a t e s s h o u l d b e r u l e d o u t .

H o w e v e r a s i n g u l a r n o u n p h r a s e c a n b e t h e r e f-

e r e n t o f a p l u ra l p r o n o u n , a s i l l u s t r a t e d b y t h ef o l l o w i n g e x a m p l e :

" I t h in k i f I te ll V i a c o m I n ee d m o r e

t ime , t h ey wi l l t a ke 'C o sb y ' a cro ss th e

s t ree t , " sa ys t h e g en era l ma n a g er o l a

n e t w o r k a ~ l i a t e .

I t i s a l s o u s e f u l t o n o t e t h e i n t e r a c t i o n b e -

t w e e n t h e h e a d c o n s t i t u e n t o f t h e p r o n o u n p

a n d t h e a n t e c e d e n t . F o r e x a m p l e :

A Ja p a n ese co mp a n y mig h t ma ke t e l e -

v is ion p ic ture tubes in Japan , asse m-b le t h e T V se t s i n Ma la ys ia a n d ex to r t

th em t o I n d o n es ia .

H e r e w e w o u l d c o m p a r e t h e d e g r e e t o w h i c h

e a c h p o ss i bl e c a n d i d a t e a n t e c e d e n t ( A J a p a n e s e

comp any, te lev is ion p ic ture tubes, Japan , T V

sets , a n d Ma la ys ia i n t h is e x a m p l e ) c o u l d s e r v e

a s t h e d i r e c t o b j e c t o f " e x p o r t " . T h e s e p r o b a -

b i l it i e s g iv e u s a w a y t o i m p l e m e n t s e l e c t i o n a l

r e s t ri c t i o n . A c a n o n i c a l e x a m p l e o f s e l e c ti o n a l

r e s t r ic t i o n i s t h a t o f t h e v e r b " e a t " , w h i c h s e-

l e c t s f o o d a s i t s d i r e c t o b j e c t . I n t h e c a s e o f" e x p o r t " t h e r e s t r i c t io n i s n o t a s c l e a r c u t . N e v -

e r t h e l e s s i t c a n s t i l l g i v e u s g u i d a n c e o n w h i c h

c a n d i d a t e s a r e m o r e p r o b ab l e t h a n o t h e r s.

T h e l a s t f a c t o r w e c o n s i d e r i s r e f e r e n t s ' m e n -

t ion count . N o u n p h r a s e s t h a t a r e m e n t i o n e d

r e p e a t e d l y a re p r e f er r e d . T h e t r a i n i n g c o r p u s i s

m a r k e d w i t h t h e n u m b e r o f t i m e s a r e f e r e n t h a s

b e e n m e n t i o n e d u p t o t h a t p o i n t i n t h e s t o r y .

H e r e w e a r e c o n c e r n e d w i t h t h e p r o b a b i l i t y t h a t

a p r o p o s e d a n t e c e d e n t is c o r r e c t g i v e n t h a t i t

h a s b e e n r e p e a t e d a c e r t a i n n u m b e r o f t i m e s .

162

I n e ff e ct , w e u s e th i s p r o b a b i l i t y i n f o r m a t i o n t o

i d e n t if y t h e t o p ic o f th e s e g m e n t w i t h t h e b e l ie f

t h a t t h e t o p i c i s m o r e l i k e l y t o b e r e f e r r e d t o b y

a p r o n o u n . T h e i d e a is s i m i l a r t o t h a t u s e d in

t h e c e n t e r i n g a p p r o a c h ( B r e n n a n e t a l. , 1 98 7 )w h e r e a c o n t i n u e d t o p i c i s t h e h i g h e s t - r a n k e d

c a n d i d a t e f o r p r o n o m i n a l i z a t i o n .

G i v e n t h e a b o v e p o s s ib l e s o u r c e s o f i n f o r m a r

t io n , w e a rr i v e a t t h e f o l lo w i n g e q u a t i o n , w h e r e

F ( p ) d e n o t e s a f u n c t i o n f r o m p r o n o u n s t o t h e i r

a n t e c e d e n t s :

F(p ) = a rg m a x P ( A( p) = a lp , h , l~ ', t , l, so , d~ A~')

w h e r e A ( p ) i s a r a n d o m v a r i a b l e d e n o t i n g t h e

r e f e r e n t o f t h e p r o n o u n p a n d a i s a p r o p o s e d

a n t e c e d e n t . I n t h e c o n d i t i o n i n g e v e n t s , h i s t h eh e a d c o n s t i t u e n t a b o v e p , l ~ i s t h e l is t o f c a n d i -

d a t e a n t e c e d e n t s t o b e c o n s i d e r e d , t i s t h e t y p e

o f p h r a s e o f t h e p r o p o s e d a n t e c e d e n t ( a lw a y s

a n o u n - p h r a s e i n t h i s s t u d y ) , I i s t h e t y p e o f

t h e h e a d c o n s t i t u e n t , s p d e s c r ib e s t h e s y n t a c t i c

s t r u c t u r e i n w h i c h p a p p e a r s , d s p e c i f i e s t h e d i s-

t a n c e o f e a c h a n t e c e d e n t f r o m p a n d M " i s t h e

n u m b e r o f t i m e s t h e r e f e r en t is m e n t i o n e d . N o t e

th a t 17r , d'~ a n d A ~ a r e v e c to r qu a n t i t i e s i n w h ic h

e a c h e n t r y c o r r e s p o n d s t o a p o s s ib l e a n t e c e d e n t .

W h e n v i e w e d i n t h i s w a y , a c a n b e r e g a r d e d a s

a n i n d e x i n t o t h e s e v e c t o r s t h a t s p e ci f ie s w h i c h

v a l u e i s r e l e v a n t t o t h e p a r t i c u l a r c h o i c e o f a n -

t e c e d e n t .

T h i s e q u a t i o n i s d e c o m p o s e d i n t o p i e c e s t h a t

c o r r e s p o n d t o a ll t h e a b o v e f a c t o r s b u t a r e m o r e

s t a ti s t ic a l ly m a n a g e a b l e . T h e d e c o m p o s i t i o n

m a k e s u s e o f B a y e s ' t h e o r e m a n d i s b a s e d o n

c e r t a i n i n d e p e n d e n c e a s s u m p t i o n s d i s c u s s e d b e -

low .

P ( A ( p ) = a l p , h , f i r , t , l , s p , d ~ . Q ' )

= P ( a l A ~ ) P ( p , h , f i r , t , l , s p , ~ a , 2 ~ ) ( 1 )

P ( p , h , f i r , t , t , s p , d i M )

o ¢ P C a l M ) P ( p , h , f i r , t , l , s p , ~ a , . Q ' ) ( 2 )

= P ( a [ : Q ) P ( . % , ~ a , :~ 'I )

P ( p , h , f ir , t , l l a , ~ , s p , i ) ( 3 )

= P ( a l l ~ ) P ( s p , d ~ a , . Q )

P C h , t , Z l a , ~ '0 " , o , i )

P C . . ~ l a , . ~ ' , s o , d , h , t , l ) ( 4 )

o c P ( a ] l ~ ) P ( S o , ~ a , M ' )

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 3/10

P ( p , 1 4 t in , ] Q , s o , d , h , t , I ) (5 )

= P ( a l . Q ) P ( s p , d ~ a , 3 ~ r )

P ( f f r l a , I ~ , s o , d , h , t , I ) . (6 )

P ( p l a . l ~ , s f , , d . h , t , l , l ~ )

cx P (a1 6 3 P(d t t la )P ( f f ' l h , t , I, a )

P (p lw° ) ( 7 )

E q u a t i o n ( 1) is s i m p l y a n a p p l i c a t i o n o f B a y e s '

r ul e. T h e d e n o m i n a t o r is e l i m i n a t e d in t h e

usua l f a sh ion , r e su l t i n g in e qu a t io n ( 2 ) . S e l e c -

t i v e l y a p p l y i n g t h e c h a i n r u l e r e s u l t s i n e q u a -

t i ons (3 ) a nd ( 4 ) . I n e qu a t io n ( 4 ) , t he t e r m

P ( h . t , l l a , . ~ , S o , d ) i s t h e s a m e f o r e v e r y a n -

t e c e d e n t a n d is t h u s re m o v e d . E q u a t i o n ( 6)f o ll o w s w h e n w e b r e a k t h e l a s t c o m p o n e n t o f

( 5 ) i n t o t w o p r o b a b i l i t y d i s t r i b u t i o n s . I n e q u a -

t i o n ( 7 ) w e m a k e t h e f o l l o w i n g i n d e p e n d e n c e a s -

s u m p t i o n s :

• G i v e n a p a r t i c u l a r c h o i c e o f t h e a n t e c e d e n t

c a n d i d a t e s , t h e d i s t a n c e i s i n d e p e n d e n t o f

d i s t a n c es o f c a n d i d a t e s o t h e r t h a n t h e a n -

t e c e d e n t ( a n d t h e d i s t a n c e t o n o n - r e f e r e n t s

c a n b e i g n o r e d ) :

P ( s o , d ~ a , 2 ~ ) o ¢ P ( s o , d o l a , I C 4 )

• T h e s y n t n c t i c s t r u c t u r e s t , a n d t h e d i s t a n c e

f r o m t h e p r o n o u n d a a r e i n d e p e n d e n t o f t h e

n u m b e r o f t i m e s t h e r e f e r e n t i s m e n t i o n e d .

T h u s

P ( s p , d o la , M ) = P ( s p , d . la )

T h e n w e c o m b i n e s p a n d d e i n t o o n e v a r i -

able d I t , H o b b s d i s t a n c e , s i n c e t h e H o b b s

a l g o r i t h m t a k e s b o t h t h e s y n t a x a n d d i s -

t a n c e i n t o a c c o u n t .

T h e w o r d s in t h e a n t e c e d e n t d e p e n d o n l y

o n t h e p a r e n t c o n s t i t u e n t h , t h e t y p e o f t h e

w o r d s t , a n d t h e t y p e o f t h e p a r e n t I. H e n c e

e ( f f ' l a , M , s p , ~ , h , t , l ) = P ( ~ l h , t , l , a )

• T h e c h o i c e p r o n o u n d e p e n d s o n l y o n t h e

w o r d s in t h e a n t e c e d e n t , i .e .

P { p l a , M , s p , d , h , t , l , ~ = P ( p l a , W )

1 6 3

• I f w e t r e a t a a s a n i n d e x i n t o t h e v e c t o r 1 ~ ,

t h e n ( a , I.V ') i s s i m p l y t h e a t h c a n d i d a t e i n

t h e l i st f fz . W e a s s u m e t h e s e l e c t i o n o f t h e

p r o n o u n i s i n d e p e n d e n t o f t h e c a n d i d a t e so t h e r t h a n t h e a n t e c e d e n t . H e n c e

P ( p l a , W ) = P ( p l w ,~ )

S inc e I ~" i s a ve c to r , w e ne e d to no r m a l -

iz e P ( f f ' l h , t , l , a ) t o o b t a i n t h e p r o b a b i l i t y o f

e a c h e l e m e n t in t h e v e c t o r . I t i s r e a s o n -

a b l e t o a s s u m e t h a t t h e a n t e c e d e n t s i n W a r e

i n d e p e n d e n t o f e a ch o t h e r ; i n o t h e r w o r d s ,

P ( w o + l l w o , h , t , l , a ) = P ( w o + l l h , t , l , a } . T h u s ,

w h e r e

n

P ( f f ' l h , t , l , a ) = 1 I P ( w i l h , t , l , a )i = l

P ( w d h , t , l , a ) = P ( w i l t ) if i # a

a n d

P ( w d h , t , l , a ) = P ( w o l h . t , l ) if i = a

T h e n w e ha v e ,

P ( f f ' l h , t , l , a ) = P ( w t l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t )

T o g e t t h e p r o b a b i l i t y f o r e a c h c a n d i d a t e , w e

d i v i d e t h e a b o v e p r o d u c t b y :

f ( I ~ l h , t , l , a )

P ( w l l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t JOC

e ( w ~ l t ) . . . P ( w ~ l t ) . . . P ( w , l t)

P ( w ~ l h , t , t )

P ( w ° l t )

N o w w e a r r i v e a t t h e f i n a l e q u a t i o n f o r c o m p u t -

i n g t h e p r o b a b i l i t y o f e a c h p r o p o s e d a n t e c e d e n t :

P(A (p) = W o) (S)

P { d H I a ) P ( p l w . ) P ~ l ) p ( a l m . )

W e o b t a i n P ( d H [ a ) b y r u n n i n g t h e H o b b s a l -

g o r i t h m o n t h e t r a i n in g d a t a . S i n c e t h e t r ai n -

i n g c o r p u s i s t a w e d w i t h r e fe r en c e i n f o rm a -

t i o n , t h e p r o b a b i l i t y P ( p l W o ) i s e a s i l y o b t a i n e d .

I n b u i l d i n g a s t a t i s t i c a l p a r s e r f o r t h e P e n n

T r e e - b a n k v a r i o u s s t a t L s t i c s h a v e b e e n c o l l e c t e d

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 4/10

( C h a r n i a k , 1 9 9 7 ), t w o o f w h i c h a r e P(w ~lh , t , l)

and P(w ~ l t , l ) . T o a v o i d t h e s p a r s e - d a t a p r o b -

l e m , t h e h e a d s h a re c l u st e r e d a c c o r d i n g t o h o w

t h e y b e h a v e i n P(w ~lh, t , l ). T h e p r o b a b i l i t y o f

w e is t h e n c o m p u t e d o n t h e b a s i s o f h 's c l u s -

t e r c ( h ) . O u r c o r p u s a ls o c o n t a i n s r e fe r e w t s '

r e p e t i t i o n i n f o r m a t i o n , f r o m w h i c h w e c a n d i -

r e c tl y c o m p u t e P(alrna ) . T h e f o u r c o m p o n e n t s

i n e q u a t i o n ( 8 ) c a n b e e s t i m a t e d i n a r e a s o n -

a b le f as h io n . T h e s y s t e m c o m p u t e s t h i s p r o d u c t

a n d r e t u r n s t h e a n t e c e d e n t t0o f o r a p r o n o u n p

t h a t m a x i m i z e s t h i s p ro b a b il i ty . M o r e fo r m a l l y ,

w e w a n t t h e p r o g r a m t o r e tu r n o u r a n t e c e d e n t

f u n c t i o n F ( p ) , w h e r e

F(p)

= arg maax P( A( p) = alp, h, 1~, t , l, sp, d: M )

= a r g m a x P ( d H [ a ) P ( p l w a )112a

e(walh, t,t) e(almo )P(wol t , t )

3 T h e I m p l e m e n t a t i o n

W e u s e a s m a l l p o r t i o n o f th e P e n n W a l l S t r e e t

J o u r n a l T r e e - b a n k a s o u r t r a in i n g c o r p u s . F r o m

t h i s d a t a , w e c o l l e c t t h e t h r e e s t a t i s t i c s d e t a i l e d

h a t h e f o l l o w i n g s u b s e c t i o n s .

3 .0 .1 T h e H o b b s a l g o r i t h m

T h e H o b b s a l g o r i t h m m a k e s a f e w a s s u m p t i o n s

a b o u t t h e s y n t a c t i c t r e e s u p o n w h i c h i t o p e r a t e s

t h a t a r e n o t s a ti s fi e d b y t h e t r e e - b a n k t r e e s t h a t

f o r m t h e s u b s t r a t e f o r o u r a l g o r i t h m . M o s t n o -

t a b l y , t h e H o b b s a l g o r i t h m d e p e n d s o n t h e e x -

i s t en c e o f a n / ~ " p a r s e - t r e e n o d e t h a t i s a b s e n t

f r o m t h e P e n n T r e e - b a n k t r ee s . W e h a v e i m -

p l e m e n t e d a s l i g h tl y m o d i f i e d v e r si o n o f H o b b s

a l g o r i t h m f o r t h e T r e e - b a n k p a r s e tr e e s . W e

a l s o t r a n s f o r m o u r t r e e s u n d e r c e r t a i n c o n d i -

t i o n s t o m e e t H o b b s ' a s s u m p t i o n s a s m u c h a s

p o s s i b le . W e h a v e n o t , h o w e v e r , b e e n a b l e t o

d u p l i c a t e e x a c t l y t h e s y n t a c t i c s t r u c t u r e s a s -

s u m e d b y H o b b s.

O n c e w e h a v e t h e t r e e s i n t h e p r o p e r f o r m

( t o t h e d e g r e e t h i s i s p o s s i b l e ) w e r u n H o b b s '

a l g o r i t h m r e p e a t e d l y f o r e a c h p r o n o u n u n t i l it

h a s p r o p o s e d n ( = 1 5 i n o u r e x p e r i m e n t ) c a n -

d i d a t e s . T h e i t h c a n d i d a t e is r e g a r d e d a s o c -

c u r r i n g a t " H o b b s d i s t a n c e " dH = i . T h e n t h e

p r o b a b i l i t y P(dH = i la ) i s s i m p l y :

P (d u -= i la)

164

I correct antecedent a t H obbs d is tance i i

[ correct antec edents 1

W e u s e [ z [ t o d e n o t e t h e n u m b e r o f t im e s z i s

o b s e r v e d i n o u r t r a i n i n g s e t .

3 .1 T h e g e n d e r / a n i m a t i c i t y s t a t i s t i c s

A f t e r w e h a v e id e n t if i ed t h e c o r r e c t a n t e c e d e n t s

i t i s a s i m p l e c o u n t i n g p r o c e d u r e t o c o m p u t e

P(p[wa) w h e r e w a is in t h e c o r r e c t a n t e c e d e n t

f o r t h e p r o n o u n p ( N o t e t h e p r o n o u n s a r e

g r o u p e d b y t h e i r g e n d e r ) :

[ wa in the an teceden t fo r p [P ( p l o ) =

W h e n t h e r e a r e m u l t i p l e r e l e v a n t w o r d s i n t h ea n t e c e d e n t w e a p p l y t h e l i k e li h o o d t e s t d e s i g n e d

b y D u n n i n g ( 19 9 3) o n a l l th e w o r d s i n t h e c a n d i -

d a t e N P . G i v e n o u r l i m i t e d d a t a , t h e D u n n i n g

t e s t t e l l s w h i c h w o r d i s t h e m o s t i n f o r m a t i v e ,

ca l l i t w i, a n d w e t h e n u s e P ( p [ w i ) .

3 . 1 .1 T h e m e n t i o n c o u n t s t a t i s t i c s

T h e r e fe r e nt s r a n g e f ro m b e i n g m e n t i o n e d o n l y

o n c e t o b e gi n m e n t i o n e d 1 20 t i m e s i n t h e t r a i n -

h ag e x a m p l e s . I n s t e a d o f c o m p u t i n g t h e p r o b a -

b U i ty fo r e a c h o n e o f th e m w e g r o u p t h e m i n t o

" b u c k e t s " , s o t h a t rrta iS t h e b u c k e t f o r t h e n u m -

b e r o f t i m e s t h a t a is m e n t i o n e d . W e a l s o o b -

s e r v e t h a t t h e p o s i t i o n o f a p r o n o u n i n a s t o r y

i n f lu e n c e s th e m e n t i o n c o u n t o f i t s r e f e r e n t . I n

o t h e r w o r d s , t h e n e a r e r t h e e n d o f t h e s t o r y a

p r o n o u n o c c u r s , th e m o r e p r o b a b l e i t is t h a t

i t s r e fe r e n t h as b e e n m e n t i o n e d s e v e r a l t i m e s .

W e m e a s u r e p o si ti o n b y t h e s e n t e n c e n u m b e r ,

j . T h e m e t h o d t o c o m p u t e t h i s p r o b a b i li t y is :

[ a is ante ceden t, rna, j I

P(a lm ~ , j ) = I m s , j l

( W e o m i t t e d j f r o m e q u a t i o n s ( 1- 7 ) to r e d u c et h e n o t a t i o n a l l o a d . )

3 .2 R e s o l v i n g P r o n o u n s

A f t e r c o l le c ti n g t h e s t a t is t i c s o n t h e t r a i n i n g e x -

a n ap l es , w e r u n t h e p r o g r a m o n t h e t e s t d a t a .

F o r a n y p r o n o u n w e co l l ec t n ( = 1 5 i n t h e e x -

p e r i m e n t ) c a n d i d a t e a n t e c e d e n t s p r o p o s e d b y

H o b b s ' a l g o r i th m . I t i s q u i t e p o s s ib l e t h a t a

w o r d a p p e a rs i n t h e t e s t d a t a t h a t t h e p r o g r a m

n e v e r s a w in t h e t r a i n i n g d a t a a n d l o w w h i c h i t

h e n c e h a s n o P ( p l w o ) p r o b a b i l i t y . I n t h i s c a s e

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 5/10

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 6/10

e r r o n e o u s c a n d i d a t e s . S p a r s e d a t a a l s o c a u s e s

a p r o b l e m i n t h i s s t a t i s t i c . C o n s e q u e n t l y , w e

o b s e r v e a r e l a t i v e l y s m a l l e n h a n c e m e n t t o t h e

s y s t e m .T h e m e n t i o n i n f o r m a t i o n g i v e s t h e s y s ~ e m

s o m e id e a o f t h e s t o r y ' s f o c u s . T h e m o r e f re -

q u e n t l y a n e n t i t y is r e p e a t e d , t h e m o r e l i ke l y i t

i s t o b e t h e t o p i c o f t h e s t o r y a n d t h u s t o b e

a c a n d i d a t e f o r p r o n o m i n a l i z a t i o n . O u r r e s u l t s

s h o w t h a t t h i s i s i n d e e d t h e c a s e . R e f e r e n c e s

b y p r o n o u n s a r e cl o s el y r e l a t e d t o t h e t o p i c o r

t h e c e n t e r o f t h e d is c o u r se . N P r e p e t i t i o n i s

o n e s i m p l e w a y o f a p p r o x i m a t e l y i d e n t if y i n g t h e

t o p i c . T h e m o r e a c c u r a t e l y t h e t o p i c o f a s e g -

m e n t c a n b e i d e n t i f i e d , t h e h i g h e r t h e s u c c e s s

r a t e w e e x p e c t a n a n a p h o r a r e s o l u t i o n s y s t e m

c a n a c h i e v e .

5 U n s u p e r v i s e d L e a r n i n g o f G e n d e r

I n f o r m a t i o n

T h e i m p o r t a n c e o f g e n d er i n f o r m a t i o n a s re -

v e a l e d i n t h e p r e v i o us e x p e r i m e n t s c a u s e d u s t o

c o n s id e r a u t o m a t i c m e t h o d s f o r e s t im a t i n g t h e

p r o b a b i l i t y t h a t n o u n s o c c u r r i n g in a l a r g e c o r -

p u s o f E n g l i s h t e x t d e o n o t e i n a n i m a t e , m a s c u -

l in e o r f e m i n i n e t h i n g s. T h e m e t h o d d e s c r i b e d

h e r e is b as e d o n s i m p l y c o u n t i n g c o - o c c u r r e n c e s

o f p r o n o u n s a n d n o u n p h r a s e s , a n d t h u s c a ne m p l o y a n y m e t h o d o f a n a l y si s o f t h e t e x t

s t r e a m t h a t r e s u l t s i n r e f e r e n t / p r o n o u n p a i r s

( c f. ( H a t z i v a s s il o g l o u a n d M c K e o w n , 1 9 9 7 )

f o r a n o t h e r a p p l i c a t i o n i n w h i c h n o e x p l ic i t

i n d i c a t o r s a r e a v a il a bl e i n t h e s t r e a m ) . W e

p r e s e n t t w o v e r y si m p l e m e t h o d s f o r f i n d i n g

r e f e r e n t / p r o n o u n p a i r s , a n d a l so g i v e a n a p p l i -

c a t i o n o f a s a l ie n c e s t a t i s t ic t h a t c a n i n d i c a t e

h o w c o n f i d e n t w e s h o u l d b e a b o u t t h e p r e d i c -

t i o n s t h e m e t h o d m a k e s . F o l l o w i n g t h is , w e

s h o w t h e r e s u l ts o f a p p l y i n g t h i s m e t h o d t o t h e

2 1 - m i l l i o n - w o r d 1 9 8 7 W a l l S t r e e t J o u r n a l c o r -p u s u s i n g t w o d i ff e r en t p r o n o u n r e f e r e n c e s t r a t e -

g i es o f v a r y i n g s o p h i st i c a ti o n , a n d e v a l u a t e t h e i r

p e r f o r m a n c e u s i n g h o n o r i f ic s a s r e l i ab l e g e n d e r

i n d i c a t o r s .

T h e m e t h o d is a v e r y s i m p l e m e c h a n i s m

f o r h a r v e s t i n g t h e k i n d o f g e n d e r i n f o r m a t i o n

p r e s e n t i n d i s c o u r s e f r a g m e n t s l i k e " K i m s l e p t .

S h e s l e p t f o r a l o ng t i m e . " E v e n i f K i m ' s g e n d e r

w a s u n k n o w n b e f o r e s e e i n g t h e f i r s t s e n t e n c e ,

a f t e r t h e s e c o n d s e n t e n c e , i t i s k n o w n .

T h e p r o b a b i l i t y t h a t a r e f e r e n t i s i n a p a r t i c -

166

u l a r g e n d e r c l as s i s j u s t t h e r e l a t i v e f r e q u e n c y

w i t h w h i c h t h a t r e f e r e n t i s r e f e r r e d t o b y a p r o -

n o u n p t h a t i s p a r t o f t h a t g e n d e r c la s s . T h a t i s,

t h e p r o b a b i l it y o f a r e f e r e n t r e f b e i n g i n g e n d e r

c lass gc~ is

P ( r e / E g ci )

= I r e fs t o r e f w i t h p e gci I (9 )

E l r ef s t o r e / w i t h p E gc j I

J

I n t h i s w o r k w e h a v e c o n s i d e r e d o n l y t h r e e

g e n d e r c l a ss e s , m a s c u l i n e , f e m i n i n e a n d i n a n i -

m a t e , w h i c h a r e i n d i c a t e d b y t h e i r t y p i c a l p r o -

n o u n s , H E , S H E , a n d I T . H o w e v e r , a v a r i e t y o f

p r o n o u n s i n d i c a t e t h e s a m e c l as s : P l u r a l p r o -

pr onoun ge nde r c l a s s

h e , h i m s e l f ,h i m , h i s H E

s h e , h e r s el f , h e r , h e r s S H E

i t , i t se l f , i t s IT

n o u n s li ke " t h e y " a n d " u s " r e v e a l n o g e n d e r in -

f o r m a t i o n a b o u t t h e i r r e f e r e n t a n d c o n s e q u e n t l y

a r e n ' t u s e fu l , a l t h o u g h t h i s m i g h t b e a w a y t o

l e a rn p lu r a l i za t i o n i n a n u n s u p e r v i s e d m a n n e r .

I n o r d e r t o g a t h e r s t a t i s t i c s o n t h e g e n d e r o f

r e f e re n t s i n a co r p u s , t h e r e m u s t b e s o m e w a y

o f i d e n t if y i n g t h e r e f e re n t s . I n a t t e m p t i n g t ob . o o t s t r a p l e x i c a l i n f o r m a t i o n a b o u t r e f e r e n t s '

g e n d e r , w e c o n s i d e r t w o s t r a t e g i e s , b o t h c o m -

p l e te l y b l i n d t o a n y k i n d o f s e m a n t i c s .

O n e o f th e m o s t n a i ve p r o n o u n r e fe r e nc e

s t r a t eg i e s i s t h e " p r e v i o u s n o u n " h e u r i s t ic . O n

t h e i n t u i t i o n p r o n o u n s c l o s e ly f ol l o w t h e i r r e f e r-

e n ts , t h i s h e u r i s t i c s i m p l y k e e p s t r a c k o f t h e l a s t

n o u n se e n a n d s u b m i t s t h a t n o u n a s t h e r ef e r-

e n t o f a n y p r o n o u n s f o l lo w i n g . T h i s s t r a t e g y i s

c e r t a i n l y s i m p l e - m i n d e d b u t , a s n o t e d e a r l i e r , i t

a c h ie v e s a n a c c u r a c y o f 4 3 % .

I n t h e p r e s e n t s y s t e m , a s t a t i s t i c a l p a r s e r i su s e d ( s e e ( C h a r n i a k , 1 9 9 7 ) ) s i m p l y a s a t a g -

g e r . T h i s a p p a r e n t p a r s e r o v e r k i l l i s a c o n t r o l

t o e n s u r e t h a t t h e p a r t - o f - s p e e c h t a g s a s s i g n e d

t o w o r d s a r e t h e s a m e w h e n w e u s e t h e p r e v i -

o u s n o u n h e u r i s t i c a n d t h e H o b b s a l g o r i t h m , t o

w h i ch w e w i sh t o c o m p a r e t h e p r e v i o u s n o u n

m e t h o d . I n f a c t , t h e o n l y p a r t- o f - s p e e c h t a g s

n e c e s s a r y a r e th o s e i n d i c a t i n g n o u n s a n d p r o -

n o u n s .

O b v i o u s l y a m u c h s u p e r i o r s t r a t e g y w o u l d

b e t o a p p l y t h e a n a p h o r a - r e s o l u t i o n s t r a t e g y

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 7/10

f r o m p r e v i o u s s e c t i o n s t o f i n d i n g p u t a t i v e r e f -

e r e n t s . H o w e v e r , w e c h o s e t o u s e o n l y t h e

H o b b s d i s ta n c e p o r t i o n t h e r e o f . W e d o n o t

u s e t h e " m e n t i o n " p r o b a b i l it i e s P(a lm a) , a st h e y a r e n o t g iv e n i n t h e u n m a r k e d t e x t . N o r

d o w e u s e t h e g e n d e r / a n i m i t i c i t y i n f o r m a t i o n

g a t h e r e d f r o m t h e m u c h s m a l l e r h a n d - m a r k e d

t e x t , b o t h b e c a u s e w e w e r e i n t e r e s t e d in s e e i ng

w h a t u n s u p e r v i s e d l e a r n i n g c o u l d a c c o m p l i s h ,

a n d b e c a u s e w e w e r e c o n c e r n e d w i t h i n h e r it -

i n g s t r o n g b i a s es f ro m t h e l i m i t e d h a n d - m a r k e d

d a t a . T h u s o u r s e c o n d m e t h o d o f f i n d in g t h e

p r o n o u n / n o u n c o - o c c u r r e n c e s i s s i m p l y t o p a r se

t h e t e x t an d t h e n a s s u m e t h a t t h e n o u n - p h r a s e

a t H o b b s d i s t a n c e o n e i s t h e a n t e c e d e n t .

G i v e n a p r o n o u n r e s o l u t i o n m e t h o d a n d a c o r-p u s , t h e r e s u l t i s a s e t o f p r o n o u n / r e f e r e n t p a i rs .

B y c o l la t in g b y r e f e re n t a n d a b s t r a c t i n g a w a y

t o t h e g e n d e r c l as s es o f p r o n o u n s , r a t h e r t h a n

i n d i v i d u a l p r o n o u n s , w e h a v e t h e r e l a t i v e f r e -

q u e n c i e s w i t h w h i c h a g i v e n r e f e r e n t i s r e f e r r e d

t o b y p r o n o u n s o f e a c h g e n d e r c la s s . W e w i ll

s a y t h a t t h e g e n d e r c l a s s f o r w h i c h t h i s r e l a t i v e

f r e q u e n c y is t h e h i g h e s t i s t h e g e n d e r c la s s t o

w h i c h t h e r e f e re n t m o s t p r o b a b l y b e l o n g s.

H o w e v e r , a n y s y n t a x - o n l y p r o n o u n r e s o lu t i o n

s t r a t e g y w i ll b e w r o n g s o m e o f t h e t i m e - t h e s e

m e t h o d s k n o w n o t h i n g a b o u t d i sc o u rs e b o u n d -a r ie s , i n t e n t i o n s , o r r e a l- w o r l d k n o w l e d g e . W e

w o u l d li ke t o k n o w , t h e r e f o r e , w h e t h e r t h e p a t -

t e r n o f p r o n o u n r e f e re n c e s t h a t w e o b s e rv e f o r

a g i v en r e f e r e n t is t h e r e s u l t o f o u r s u p p o s e d

" h y p o t h e s i s a b o u t p r o n o u n r e f e re n c e " - t h a t i s,

t h e p r o n o u n r e f e re n c e s t r a t e g y w e h a v e p r ov i -

s i o n a l l y a d o p t e d i n o r d e r t o g a t h e r s t a t i s t i c s -

o r w h e t h e r t h e r e s u lt o f s o m e o t h e r u n i d e n t if i e d

p r oc e s s .

T h i s d e c i s i o n i s m a d e b y r a n k i n g t h e r e f e r -

e n t s b y l o g - li k e l ih o o d r a t i o , t e r m e d s a l i e n c e , f o r

e a c h r e f e r e n t . T h e l ik e l i h o o d r a t i o is a d a p t e df r o m D u n n i n g ( 1 9 9 3 , p a g e 6 6 ) a n d u s e s t h e r a w

f r e q u e n c i e s o f e a c h p r o n o u n c l a ss i n t h e c o r -

p u s a s t h e n u l l h y p o t h e s i s , P r ( g c 0 i ) a s w e l l a s

P r ( r e f E gci) f r o m e q u a t i o n 9 .

s a l i e n c e ( r e / )

= - 2 l og

M a k i n g t h e u n r e a l i s t i c s i m p l i f y i n g a s s u m p t i o n

t h a t r e fe r e n ce s o f o n e g e n d e r c l a s s a r e c o m -

p l e t el y i n d e p e n d e n t o f r e fe r e n c e s f o r a n o t h e r

c l a s se s 1 , t he l i ke l ihood f un c t io n in t h i s c a se i sj u s t t h e p r o d u c t o v e r a ll cl a s s e s o f t h e p r o b a b i l -

i t ie s o f e ac h c l a s s o f r e f e r e n c e t o t h e p o w e r o f

t h e n u m b e r o f o b s e r v a t i o n s o f th i s c l a ss .

6 E v a l u a t i o n

W e r a n t h e p r o g r a m o n 2 1 m i l li o n w o r d s o f W a l l

S t r e e t J o u r n a l t e x t . O n e c a n j u d g e t h e p r o -

g r a m i n f o r m a l l y b y s i m p l y e x a m i n i n g t h e r e -

s u l t s a n d d e t e r m i n i n g i f t h e p r o g r a m ' s g e n d e r

d e c i s io n s a r e c o r r e c t ( o c c a s i o n a l l y l o o k i n g a t t h e

t e x t f o r d i f fi c u l t c a s e s ) . F i g u r e 1 s h o w s t h e 4 3

n o u n p h r a s e s w i t h t h e h i g h e s t s a l i e n c e f i g u r e s

( r u n u s in g t h e H o b b s a l g o r i t h m ) . A n e x a m i n a -

t i o n o f t h e s e s h o w t h a t a ll b u t t h r e e a r e c o r r e c t .

( T h e t h r e e m i s t a k e s a r e " h u s b a n d , " " w if e, " a n d

" y e a r s . " W e r e t u r n t o t h e s i g n i f i c an c e o f t h e s e

m i s t a k e s l a t e r . )

A s a m e a s u r e o f t h e u t i l i t y o f t h e s e r e su l t s , w e

a l s o r a n o u r p r o n o u n - a n a p h o r a p r o g r a m w i t h

t h e s e s t a t i s t i c s a d d e d . T h i s a c h i e v e d a n a c c u -

r a c y r a t e o f 8 4 . 2 % . T h i s i s o n l y a s m a l l i m p r o v e -

m e n t o v er w h a t w a s a c h i e v e d w i t h o u t t h e d a t a .

W e b e l ie v e , h o w e v e r , t h a t t h e r e a r e w a y s t o i m -

p r o v e t h e a c c u r a c y o f t h e l e a r n i n g m e t h o d a n d

t h u s i n c r e as e it s i n f lu e n c e o n p r o n o u n a n a p h o r a

r e s o l u t i o n .

F i n a l l y w e a t t e m p t e d a fu l l y a u t o m a t i c d i -

r e c t t es t o f t h e a c c u r a c y o f b o t h p r o n o u n m e t h -

o d s f o r g e n d e r d e t e r m i n a t i o n . T o t h a t e n d , w e

d e v i s e d a m o r e o b j e c t i v e t e s t , u s e f u l o n l y f o r

s c o r in g t h e s u b s e t o f r e f e r e n t s t h a t a r e n a m e s

o f p e o p le . I n p a r t i c u l a r , w e a s s u m e t h a t a n y

n o u n - p h r a s e w i t h t h e h o n o r if i c s " M r . " . " M r s . "

o r " M s . " m a y b e c o n f i d e n t l y a s s ig n e d t o g e n d e r

c l a ss e s H E , S H E , a n d S H E , r e s p e c t i v e l y . T h u s w e

c o m p u t e p r e c i s i o n a s f o l l o w s :

p r e c i s ion =

[ r a t t r i b . a s H E A M r . E r l +

[ r a t t r i b . a s S H E A M r s. or M s. E r [

I M r . , M r s . , o r M s . E r ]

H e r e r v a r i e s o v e r r e f e r e n t t y p e s , n o t t o k e n s .

T h e p r e ci s io n s c o r e c o m p u t e d o v e r a ll p h r a s e s

c o n t a i n i n g a n y o f t h e t a r g e t h o n o r if i cs a r e 6 6 .0 %

l I n ef f e ct , t h i s is t h e s a m e a s a d m i t t i n g t h a t a r e f -

e r e n t c a n b e i n d i f f e r e n t g e n d e r c l a s s e s a c r o s s d i f f e re n t

o b s e r v a t i o n s .

167

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 8/10

W o r d co u n t ( sa l i en ce ) p ( h e ) p ( sh e ) p ( i t )

C O M P A N Y 7 0 52 ( 1 62 9 .39 ) 0 .0 7 64 0 .0 06 0 0 .9 1 7 4W OM AN 2 50 ( 36 8 .2 67 ) 0 .1 72 0 .7 0 8 0 .1 2

P R E S I D E N T 9 3 :[ (3 5 6 .5 3 9 ) 0 .8 20 6 0 .0 1 3 9 0 . 1 65 4

G R O U P 1 0 9 6 (2 8 7 .31 9 ) 0 .0 6 02 0 .0 0 54 0 .9 343

M R . R E A GA N 53 , t (2 7 0 .8) . 88 2 02 2 0 .0 0 37 0 .1 1 42

M AN 441 ( 2 0 2 .1 0 2 ) 0 .8 480 0 .0 38 5 0 .1 1 33

P R E S I D E N T R E A G A N 4 5 5 (1 9 4 .9 2 8) 0 .8 43 9 0 .0 04 3 0 .1 5 1 6

G O V E R N M E N T 1 2 20 (1 9 4 .1 8 7) 0 .1 17 2 0 . 01 2 2 0 . 87 0 4

U.S . 969(188 .468) 0 .1021 0 .0041 0 .8937

BA N K 81(5(161 .23) 0 .0955 0 .0073 0 .8970

M O T H E R 1 1 3 ( 1 6 1 . 2 0 4 ) 0 . 3 0 0 8 0 . 6 5 4 8 0 . 0 4 4 2

C O L . N O R T H 2 5 8 ( 1 5 8 . 6 9 2 ) 0 . 9 2 6 3 0 . 0 0 7 7 0 . 0 6 5 8

M O O D Y 38 3 ( 1 52 .40 5 ) 0 .00 7 8 0 .0 0 52 0 .9 8 6 9S P O K E S W O M A N 1 3 9 (1 4 5 .6 2 7) 0 .1 22 3 0 . 5 82 7 0 . 29 4 9

M R S . A QU I N O 7 3 ( 1 42 .2 2 3 ) 0 .0 9 58 0 .8 356 0 .0 6 8 4

M R S . T H A T C H E R 6 8 (1 2 8 .3 0 6 ) 0 .0 7 3 5 0 .8 2 3 5 0 . 1 02 9G M 513(119 .664 ) 0 .0779 0 .0038 0 .9181

P L A N 51 4 ( 1 1 1 .1 34 ) 0 .0 856 0 .0 0 58 0 .9 0 8 5

M R . G O R B A C H E V 2 0 5 (1 0 8 .7 7 6 ) 0 .8 9 26 0 . 00 4 8 0 . 10 2 4

J U D G E B O R K 2 1 2 (1 0 8 .7 4 6 ) 0 .8 8 20 0 0 . 11 7 9

HU S B A ND 9 1 ( 1 0 7.438 ) 0 .36 2 6 0 .57 1 4 0 .0 6 59

JA P A N 450 ( 1 0 0 .7 2 7 ) 0 .0 755 0 .0 11 1 0 .9 1 33

A G E N C Y 4 7 6 (9 7 .4 0 1 6 ) 0 .0 8 4 0 0 . 01 4 7 0 . 90 1 2

W I F E 1 53 ( 9 3 .748 5 ) 0 .6 1 43 0 .2 8 75 0 .0 9 8 0

D O L L A R 6 2 1 ( 9 0.8 9 6 3 ) 0 .1 304 0 .0 0 9 6 0 .8 59 9S T A N D A R D P O O R 2 0 0( 90 .1 06 2 ) 0 0 1FA TH E R 1 46 ( 8 9 .41 7 8 - ) 0 .8 0 82 0 .1 438 0 .0 47 9

U TI L I T Y 2 42 ( 8 7 .1 8 2 1 ) 0 .0 247 0 0 .9 7 52

M R . T R U M P 1 2 9( 86 .5 3 4 5 ) 0 .9 4 5 7 0 . 00 7 7 0 .0 4 6 5

M R . B A K E R 1 8 7 (8 4 .2 7 96 ) 0 .8 556 0 .0 0 53 0 .1 39 0

I B M 31 6 ( 8 2 .436 1 ) 0 .0 69 6 0 0 .9 30 3

M A K E R 2 2 4 (8 2 .2 5 2 ) 0 .0 2 23 0 0 . 97 7 6

YE AR S 1 0 55 ( 8 2 .1 6 32 ) 0 .52 98 0 .0 81 5 0 .38 8 6

M R . M E E S E 1 6 6 (8 2 .10 0 7 ) 0 .87 34 0 0 .1 2 65

B R AZ I L 2 8 5 ( 7 9 .7 31 1 ) 0 .0 596 0 0 .9 40 3

S P O K E S M A N 6 6 5 (7 8 .3 4 4 1 ) 0 .6 0 75 0 . 00 4 5 0 . 3 87 9

M R . S I M ON 1 0 5 ( 7 2.6 446 ) 0 .9 523 0 0 .0 47 6DAUGHTER 47(71.3863) 0.2340 0 .7021 0.0638

FO R D 2 49 ( 7 1 .36 0 3 ) 0 .056 2 0 0 .9 437

M R . G R E E N S P A N 1 2 0( 68 .7 8 07 ) 0 .9 0 83 0 0 . 09 1 6

AT & T 1 9 8 (6 7 .96 6 8 ) 0 .02 52 0 .0 0 50 0 .9 6 9 6

M I N I S T E R 1 2 5 (6 7 .7 4 7 5 ) 0 .8 6 4 0 . 06 4 0 . 07 2

JU D G E - 2 39 ( 6 7 .58 9 9 ) 0 .7 154 0 .0 8 36 0 .2 0 0 8

F i g u r e 1 : T o p 4 3 n o u n p h r a s e s a c c o r d i n g t o s a l i e n c e

168

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 9/10

o

o~

o

l . O -

0 . 8 -

0 . 5 -

U

• • 0 •

0

• " ' ' ' " 1 " " " ' ' ' " |

10 100

N u m b e r o f r e fe r e n c es

O •

F i g u r e 2 : P r e c i s i o n u s i n g h o n o r i fi c s c o r i n g

s c h e m e w i t h s y n t a c t i c H o b b s a l g o r i t h m

f or t h e l a s t - n o u n m e t h o d a n d 7 0 . 3 % f o r t h e

H o b b s m e t h o d .

T h e r e a r e s e v e r a l t h i n g s t o n o t e a b o u t t h e s e

r e s u l t s . F i r s t , a s o n e m i g h t e x p e c t g i v e n t h e a l -

r e a d y n o t e d s u p e r i o r p e r f o r m a n c e o f t h e H o b b s

s c h e m e o v e r l a s t -n o u n , H o b b s a ls o p e r f o r m s b e t -

t e r a t d e t e r m i n i n g g e n d e r . S e c o n d l y , a t f i r st

g l a n c e , th e 7 0 .3 % a c c u r a c y o f t h e H o b b s m e t h o d

i s d i s a p p o i n t i n g , o n l y s l i g h t l y s u p e r i o r t o t h e

6 5 .3 % a c c u r a c y o f H o b b s a t f i n d i n g c o r r e c t r e f-

e r e n ts . I t m i g h t h a v e b e e n h o p e d t h a t t h e

s t a t is t i cs w o u l d m a k e t h i n g s c o n s i d e r a b l y m o r e

a c c u r a t e .

I n f a c t , t h e s t a ti s ti c s d o m a k e t h i n gs c o n s id -

e r a b l y m o r e a c cu r at e . F i g u r e 2 s h o w s a v e r a g ea c c u r a c y a s a f u n c ti o n o f n u m b e r o f r e f e re n c e s

f o r a g i v e n r e f e r e n t . I t c a n b e s e e n t h a t t h e r e i s

a s ig n if i ca n t i m p r o v e m e n t w i t h i n c r ea s e d r ef er -

e n t c o u n t . T h e r e a s o n t h a t t h e a v e r a g e o v e r al l

r e f e r en t s i s s o l o w i s t h a t t h e c o u n t s o n r e f e r e n t s

o b e y Z i p f 's l a w , s o t h a t t h e m o d e ~ f t h e di st ri -

b u t i o n o n c o u n t s is o n e . T h u s t h e 7 0 . 3 % o v e ra l l

a c c u r a c y i s a m i x o f r e la t i ve l y h i g h a c c u r a c y f o r

r e fe r en t s w i t h c o u n t s g r e a te r t h a n o n e , a n d r el -

a t iv e ly l o w a c c u r a c y f o r r e f e re n t s w i t h c o u n t s o f

e x a c t l y o n e .

7 P r e v i o u s W o r k

T h e l i t e r a t u r e o n p r o n o u n a n a p h o r a i s t o o e x -

t e n s iv e t o s u m m a r i z e , s o w e c o n c e n t r a t e h e r e o n

c o r p u s - b a s e d a n a p h o r a r e s e a r c h .A o n e a n d B e n n e t t ( 19 9 6) p r e s e n t an a p -

p r o a c h to a n a u t o m a t i c a l ly t r a i n a b le a n a p h o r a

r e s o lu t i o n s y s t e m . T h e y u s e J a p a n e s e n e w s p a -

p e r a r t i c l e s t a g g e d w i t h d i s c o u r s e i n f o r m a t i o n

a s t r a i n i n g e x a m p l e s f o r a m a c h i n e - l e a r n i n g a l -

g o r i t h m w h i c h i s t h e C 4 . 5 d e c i s i o n - t r e e a l g o -

r i t h m b y Q u i n l a n ( 1 99 3 ). T h e y t r a in t h e i r d e -

c i s i o n t r e e u s i n g (anaphora, antecedent) p a i r s

t o g e t h e r w i t h a s e t o f f e a t u r e v e c t o rs . A m o n g

t h e 6 6 f e a t u r e s a r e l e x i c a l , s y n t a c t i c , s e m a n -

t ic , a n d p o s i ti o n a l f e a t u r e s . T h e i r M a c h i n e

L e a r n i n g - b a s e d R e s o l v e r ( M L R ) i s t r a i n e d u s -i n g d e c i s io n t r e e s w i t h 1 9 7 1 a n a p h o r a s ( e x c l u d -

i n g t h o s e r e f e r ri n g t o m u l t ip l e d i s c o n t i n u o u s a n -

t e c e d e n t s ) a n d t h e y r e p o r t a n a v e r a g e s u c c es s

r a t e o f 7 4 . 8 % .

M i t k o v ( 1 9 9 7 ) d e s c r i b e s a n a p p r o a c h t h a t

u s e s a s e t o f f a c t o r s a s c o n s t r a i n t s a n d p r e f e r -

e n c e s. T h e c o n s t r a i n t s r u l e o u t i m p l a u si b le c a n -

d i d a t e s a n d t h e p r e f e r en c e s e m p h a s i z e t h e s e l ec -

t i o n o f t h e m o s t l ik e ly a n t e c e d e n t . T h e s y s t e m

i s n o t e n t i r e l y " s t a t i s t i c a l " i n t h a t i t c o n s i s t s o f

v a r i ou s t y p e s o f r u l e -b a s e d k n o w l e d g e - - s y n -

t ac ti c, s e m a n t i c , d o m a i n , d i s c ou r s e , a n d h e u r is -t ic . A s t at i s ti c a l a p p r o a c h i s p r e s e n t i n t h e d i s -

c o u r s e m o d u l e o n l y w h e r e i t i s u s e d t o d e te r -

m i n e t h e pr o ba b il i t y t h a t a n o u n ( v e r b ) p h r a s e

i s t h e c e n te r o f a s e nt e n c e . T h e s y s t e m a l s o c o n -

t a in s d o m a i n k n o w l e d g e i n cl u di n g t he d o m a i n

c o n c e p t s , s p e c i f ic l i st o f s u b j e c t s a n d v e r b s , a n d

t o p i c h e a d i ng s . T h e e v a l u at i o n w a s c o n d u c t e d

o n 1 3 3 p a r a g r a p h s o f a n n o t a t e d C o m p u t e r S ci -

e n c e t e x t . T h e r e s u l t s s h o w a n a c c u r a c y o f 8 3 %

f o r t h e 5 1 2 o c c u r r e n c e s o f it.

L a p p i n a n d L e a s s ( 1 9 9 4 ) r e p o r t o n a ( e ss e n -

t i al l y n o n - s t a t i s t i c a l ) a p p r o a c h t h a t r e l i es o ns a l i en c e m e a s u r e s d e r i v e d f r o m s y n t ac t i c s t r u c-

t u r e a n d a d y n a m i c m o d e l o f a t te n ti o n a l s t at e .

T h e s y s t e m e m p l o y s v a ri o us c o n s tr a in t s f or N P -

p r o n o u n n o n - c o r e f e r e n c e w i t h i n a s e n t en c e . I t

a l s o u s e s p e r s on , n u m b e r , a n d g e n d e r f e at u r e s

f o r r u l i n g o u t a n a p h o r i c d e p e n d e n c e o f a p r o-

n o u n o n a n N P . T h e a l g o r i t h m h a s a s op hi st i-

c a t e d m e c h a n i s m f o r a s s i gn i n g v a l u es t o s e v e ra l

s a li e n c e p a r a m e t e r s a n d f o r c o m p u t i n g g l ob a l

s a l i en c e v a l u es . A b l i n d t e s t w a s c o n d u c t e d

o n m a n u a l t e x t c o n t a i n i n g 3 6 0 p r o n o u n o c c u r -

169

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 10/10

r e nc e s ; t he a l go r i t hm suc c e s s fu l l y i de n t i f i e d t he

a n t e c e d e n t o f t h e p r o n o u n i n 8 6 % o f t h e s e p r o -

n o u n o c c u rr e n ce s . T h e a d d i t i o n o f a m o d u l e

t h a t c o n t r i b u t e s s ta t i s t ic a l l y m e a s u r e d l e x Jc al

p r e f e r e n ce s t o t h e r a n g e o f f a c t o r s t h e a l g o r i t h m

c o n s id e r s i m p r o v e d t h e p e r f o r m a n c e b y 2 % .

8 C o n c l u s t i o n a n d F u t u r e R e s e a r c h

W e h a v e p r e s e n t e d a s t a t i s t i c a l m e t h o d f o r

p r o n o m i n a l a n a p h o r a t h a t a c h ie v e s a n a c c u r a c y

o f 8 4 .2 % . T h e m a i n a d v a n t a g e o f th e m e t h o d i s

i t s es s e n ti a l s i m p l ic i ty . E x c e p t f o r i m p l e m e n t i n g

t h e H o b b s r e f e r e n t - o r d e r i n g a l g o r i t h m , a l l o t h e r

s y s t e m k n o w l e d g e i s i m b e d d e d i n t a b l e s g iv i n g

t h e v a r i o u s c o m p o n e n t p r o b a b i l i t i e s u s e d in t h ep r o b a b i l i t y m o d e l .

W e b e l ie v e t h a t t h i s s im p l i c i t y o f m e t h o d w i l l

t r a n s l a t e i n t o c o m p a r a t i v e s i m p l i c i t y a s w e i m -

p r o v e t h e m e t h o d . S i n c e t h e r e s e a r c h d e s c r i b e d

h e r e in w e h a v e t h o u g h t o f o t h e r i n f lu e n c e s o n

a n a p h o r a r e s o l u t i o n a n d t h e i r s t a t i s t i c a l c o r r e -

l a te s . W e h o p e to i n c l u d e s o m e o f t h e m i n f u t u r e

w o r k .

A l s o , as i n d i c a t e d b y t h e w o r k o n u n s u p e r -

v i s e d le a r n i n g o f g e n d e r i n f o r m a t i o n , t h e r e i s a

g r o w i n g a r s e n a l o f l e a r n i n g t e c h n i q u e s t o b e a p -

p l ie d t o s t a t i s ti c a l p r o b l e m s . C o n s i d e r a g a i n t h et h r e e h i g h - s a li e n c e w o r d s t o w h i c h o u r u n s u p e r -

v i s e d l e a r n i n g p r o g r a m a s s i g n e d i n c o r r e c t g e n -

d e r : " h u s b a n d " , " w i f e ", a n d " y e a r s ." W e s u s -

p e c t t h a t h a d o u r p r o n o u n - a s s ig n m e n t m e t h o d

b e e n a b l e t o u s e t h e t o p i c i n f o r m a t i o n u s e d i n

t h e c o m p l e t e m e t h o d , t h e s e m i g h t w e l l h a v e

b e e n d e c i d e d c o r r ec t ly . T h a t i s, w e s u s p e c t

t h a t " h u s b a n d " , f o r e x a m p l e , w a s d e c i d e d i n -

c o r r e c t ly b e c a u s e t h e t o p i c o f t h e a r t ic l e w a s t h e

w o m a n , t h e re w a s a m e n t i o n o f h e r " h u s b a n d , "

b u t t h e a r t ic l e k e p t o n t al k i n g a b o u t t h e w o m a n

a n d u s e d th e p r o n o u n " s h e ." W h i l e o u r si m p l ep r o g r a m g o t c o n f u s e d , a p r o g r a m u s i n g b e t t e r

s t a t i s t ic s m i g h t n o t h a v e . T h i s t o o i s a t o p i c f o r

f u t u r e r e s e a r c h .

9 A c k n o w l e d g e m e n t s

T h e a u t h o r s w o u l d l i k e t o t h a n k M a r k J o h n s o n

a n d o t h e r m e m b e r s o f t h e B r o w n N L P g r o u p

f o r m a n y u s e f u l i d ea s a n d N S F a n d O N R f o r

s u p p o r t ( N S F g r a n t s I R I - 9 3 1 9 5 1 6 a n d S B R -

9 7 20 3 68 , O N R g r a n t N 0 0 1 4 - 9 6 - 1 -0 5 4 9 ) .

170

R e f e r e n c e s

C h i n a t s u A o n e a n d S c o t t W i l l i a m B e n n e t t ,

1996. Evaluat ing Autom ated an d Manual Ac-

quisit ion off An aph ora Res olutio n Strategies,p a g e s 3 0 2 - 3 1 5 . S p r i n g e r .

S u s a n E . B r e n n a n , M a r i l y n W a l k e r F r i e d m a n ,

a n d C a r l J . P o l l a r d . 1 9 8 7. A c e n t e r i n g ap -

p r o a c h t o p r o n o u n s . I n P r o c . 25 th AnnualMeet ing o f the A CL, p a g e s 1 5 5 - 1 6 2 . A s s o c i a -

t i o n o f C o m p u t a t i o n a l L i n g u i s t i c s .

E u g e n e C h a r n i a k . 1 9 9 7 . S t a t i s t i c a l p a r s i n g

w i t h a c o n t e x t - fr e e g r a m m a r a n d w o r d s ta t is -

t ic s . In Proceedings of the 1 4th N at ional Con-ference on Art i f icial Intel ligence, M e n l o P a r k ,

C A . A A A I P r e s s / M I T P r e ss .

T e d D u n n i n g . 1 9 93 . A c c u r a t e m e t h o d s f o r t h es t a t i s t i c s o f s u r p r i s e a n d c o i n c i d e n c e . Com-putat ional Linguist ics, 1 9 ( 1 ) , M a r c h .

V a s i l e i o s H a t z i v a s s i l o g l o u a n d K a t h l e e n R .

M c K e o w n . 1 9 97 . P r e d i c t i n g t h e s e m a n t i c o r i-

e n t a t i o n o f a d j e c ti v e s . I n Proc. 35 th AnnualMeet ing of the ACL, p a g e s 1 7 4 - 1 8 1 . A s s o c i a -

t i on o f C o m p u t a t i o n a l L i n g u is t ic s .

J e r r y R . H o b b s . 1 9 76 . P r o n o u n r es o l u ti o n .

T e c h n i c a l R e p o r t 7 6 - 1 , C i t y C o l l e g e , N e w

Y o r k .

S h a l o m L a p p i n a n d H e r b e r t J . L e a ss . 1 99 4 . A n

a l g o r i t h m f o r p r o n o m i n a l a n a p h o r a r e s o l u -t i o n . Computat ional Linguist ics, p a g e s 5 3 5 -

"561.

M i t c h e l l P . M a r c u s , B e a t r i c e S a n t o r i n i , a n d

M a r y A n n M a r c i n k i e w i c z . 1 9 9 3 . B u i l d i n g

a l a r g e a n n o t a t e d c o r p u s o f e n g l is h : t h e

p e n n t r e e b a n k . Computat ional Linguist ics,1 9 : 3 1 3 - 3 3 0 .

R u s l a n M i t k o v . 1 9 97 . F a c t o r s i n a n a p h o r a r es -

o l u t io n : t h e y a r e n o t t h e o n ly t h i n g s t h a t

m a t t e r , a c a s e s t u d y b a s e d o n t w o d i f f e r -

e n t a p p r o a c h e s . I n Proceedings o f the A CL

'g7/E A CL 'g7 W orkshop o n Operat ional Fac-tors in Practical, Robust Anaphora Resolu-tion.

J . R o s s Q u i n l a n . 1 9 9 3 . C~.5 Programs for Ma-chine Learning. M o r g a n K a u f m a n n P u b l i s h -

e rs .

top related