Top Banner
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Information Extraction (I) Named Entity Recognition (NER) Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016 1
41

IE: Named Entity Recognition (NER)

Jan 20, 2017

Download

Education

Marina Santini
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IE: Named Entity Recognition (NER)

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Information Extraction (I)

Named Entity Recognition (NER) Marina  San(ni  

[email protected]  

 

Department  of  Linguis(cs  and  Philology  

Uppsala  University,  Uppsala,  Sweden  

 

Spring  2016  

 

 1  

Page 2: IE: Named Entity Recognition (NER)

Previous  Lecture:  Distribu$onal  Seman$cs  •  Star(ng  from  Shakespeare  and  IR  (term-­‐document  matrix)  …  

•  Moving  to  context  ”windows”  taken  from  the  Brown  corpus…  

•  Ending  up  to  PPMI  to  weigh  word  distribu(on…  

•  Men(oning  cosine  metric  to  compare  vectors….  

2  

Page 3: IE: Named Entity Recognition (NER)

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

IR:  Term-­‐document  matrix  

•  Each  cell:  count  of  term  t  in  a  document  d:    Nt,d:    •  Each  document  is  a  count  vector  in  ℕv:  a  column  below    

3  

Term  frequency  of  t  in  d  

Page 4: IE: Named Entity Recognition (NER)

Document  similarity:  Term-­‐document  matrix  

•  Two  documents  are  similar  if  their  vectors  are  similar  

4  

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 5: IE: Named Entity Recognition (NER)

The  words  in  a  term-­‐document  matrix  

•  Two  words  are  similar  if  their  vectors  are  similar  

5  

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 6: IE: Named Entity Recognition (NER)

Term-­‐context  matrix  for  word  similarity  

•  Two  words  are  similar  in  meaning  if  their  context  vectors  are  similar  

6  

aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0

we say, two words are similarin meaning if their context vectors are similar.  

Page 7: IE: Named Entity Recognition (NER)

Compu$ng  PPMI  on  a  term-­‐context  matrix  

•  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)  •  fij  is  #  of  $mes  wi  occurs  in  context  cj

7  

pij =fij

fijj=1

C

∑i=1

W

∑pi* =

fijj=1

C

fijj=1

C

∑i=1

W

∑ p* j =fij

i=1

W

fijj=1

C

∑i=1

W

pmiij = log2pij

pi*p* jppmiij =

pmiij if pmiij > 0

0 otherwise

!"#

$#

The  count  of  all  the  words  that  occur  in  that  context  

The  count  of  all  the  contexts  where  the  word  appear  

The  sum  of  all  words  in  all  contexts  =  all  the  numbers  in  the  matrix  

Page 8: IE: Named Entity Recognition (NER)

Summa$on:  Sigma  Nota$on  (i)  

8  

It means: sum whatever appears after the Sigma: so we sum n. What is the value of n ? The values are shown below and above the Sigma. Below --> index variable (eg. start from 1); Above --> the range of the sum (eg. from 1 up to 4). In this case, it says that n goes from 1 to 4, which is 1, 2, 3 and 4 (http://www.mathsisfun.com/algebra/sigma-notation.html )  

pij =fij

fijj=1

C

∑i=1

W

∑we can’t delete f(i,j) !!!  

Sum  from  i=1  to  4  

Page 9: IE: Named Entity Recognition (NER)

Summa$on:  Sigma  Nota$on  (ii)    

•  Addi(onal  examples  

•  Sums  can  be  nested  

9  

Page 10: IE: Named Entity Recognition (NER)

Alterna$ve  nota$ons…  (Levy,  2012)  

•  When,  the  range  of  the  sum  can  be  understood  from  context,  it  ca  be  le\  out;    

•  or  we  want  to  be  vague  about  the  precise  range  of  the  sum.  For  example,  suppose  that  there  are  n  variables,  x1  through  xn.    

•  In  order  to  say  that  the  sum  of  all  n  variables  is  equal  to  1,  we  might  simply  write:    

10  

Page 11: IE: Named Entity Recognition (NER)

Formulas:  Sigma  Nota$on  

11  

pij =fij

fijj=1

C

∑i=1

W

pi* =fij

j=1

C

fijj=1

C

∑i=1

W

p* j =fij

i=1

W

fijj=1

C

∑i=1

W

•  Numerator:  f  ij  =  a  single  cell    

•  Denominators:  sum  the  cells  of  all  the  words  and  the  cells  of  all  the  contexts  

•  Numerator:  sum  the  cells  of  all  contexts  (all  the  columns)  

•  Numerator:  sum  the  cells  of  all  the  words  (all  the  rows)    

Page 12: IE: Named Entity Recognition (NER)

Living  lexicon:  built  upon  an  underlying  con$nously  updated  corpus    

12  Drawbacks:  Updated  but  unstable  &  incomplete:  missing words, missing  linguis(c  informa(on,  etc.    

Mul(lingualiy,  func(on  words,  etc.    

Page 13: IE: Named Entity Recognition (NER)

Similarity:    •  Given  the  underlying  sta(s(cal  model,  these  words  are  similar  

13  

Fredrik  Olsson  

Page 14: IE: Named Entity Recognition (NER)

Gavagai  blog  •  Further  reading  (Magnus  Sahlgren)  :  

heps://www.gavagai.se/blog/2015/09/30/a-­‐brief-­‐history-­‐of-­‐word-­‐embeddings/    

14  

Page 15: IE: Named Entity Recognition (NER)

End  of  previous  lecture  

15  

Page 16: IE: Named Entity Recognition (NER)

Acknowledgements Most  slides  borrowed  or  adapted  from:  

Dan  Jurafsky  and  Christopher  Manning,  Coursera  

Dan  Jurafsky  and  James  H.  Mar(n  

   

 

J&M(2015,  dra\):  heps://web.stanford.edu/~jurafsky/slp3/      

 

     

Page 17: IE: Named Entity Recognition (NER)

Preliminary:  What’s  Informa$on  Extrac$on  (IE)?    

•  IE  =  text  analy(cs  =  text  mining  =  e-­‐discovery,  etc.  

•  The  ul(mate  goal  is  to  convert  unstructured  text  into  structured  informa(on  (so  informa(on  of  interest  can  easily  be  picked  up).  

•  unstructured  data/text:  email,  PDF  files,  social  media  posts,  tweets,  text  messages,  blogs,  basically  any  running  text...  

•  structured  data/text:  databases  (xlm,  sql,  etc.),  ontologies,  dic(onaries,  etc.    

17  

Page 18: IE: Named Entity Recognition (NER)

Informa$on  Extrac$on  and  Named  En$ty  Recogni$on  

Introducing  the  tasks:  Gelng  simple  structured  informa(on  out  of  text  

Page 19: IE: Named Entity Recognition (NER)

Informa$on  Extrac$on  

•  Informa(on  extrac(on  (IE)  systems  •  Find  and  understand  limited  relevant  parts  of  texts  •  Gather  informa(on  from  many  pieces  of  text  •  Produce  a  structured  representa(on  of  relevant  informa(on:    •  rela3ons  (in  the  database  sense),  a.k.a.,  •  a  knowledge  base  

•  Goals:  1.  Organize  informa(on  so  that  it  is  useful  to  people  2.  Put  informa(on  in  a  seman(cally  precise  form  that  allows  further  

inferences  to  be  made  by  computer  algorithms  

Page 20: IE: Named Entity Recognition (NER)

Informa$on  Extrac$on:  factual  info  

•  IE  systems  extract  clear,  factual  informa(on  •  Roughly:  Who  did  what  to  whom  when?  

•  E.g.,  •  Gathering  earnings,  profits,  board  members,  headquarters,  etc.  from  company  reports    •  The  headquarters  of  BHP  Billiton  Limited,  and  the  global  headquarters  of  the  combined  BHP  Billiton  Group,  are  located  in  Melbourne,  Australia.    

•  headquarters(“BHP  Biliton  Limited”,  “Melbourne,  Australia”)  

•  Learn  drug-­‐gene  product  interac(ons  from  medical  research  literature  

Page 21: IE: Named Entity Recognition (NER)

Low-­‐level  informa$on  extrac$on  

•  Is  now  available  –  and  I  think  popular  –  in  applica(ons  like  Apple  or  Google  mail,  and  web  indexing  

•  O\en  seems  to  be  based  on  regular  expressions  and  name  lists  

Page 22: IE: Named Entity Recognition (NER)

Low-­‐level  informa$on  extrac$on  

Page 23: IE: Named Entity Recognition (NER)

•  A  very  important  sub-­‐task:  find  and  classify  names  in  text.  

•  An  en(ty  is  a  discrete  thing  like  “IBM  Corpora(on”  •  Named” means called “IBM” or “Big Blue” not “it” or

“the company”

•  often extended in practice to things like dates, instances of products and chemical/biological substances that aren’t really entities…

•  But also used for times, dates, proteins, etc., which aren’t entities – easy to recognize semantic classes

Named  En$ty  Recogni$on  (NER)  

Page 24: IE: Named Entity Recognition (NER)

Named  En$ty  Recogni$on  (NER)  •  A  very  important  sub-­‐task:  find  and  

classify  names  in  text,  for  example:  

•  The  decision  by  the  independent  MP  Andrew  Wilkie  to  withdraw  his  support  for  the  minority  Labor  government  sounded  drama(c  but  it  should  not  further  threaten  its  stability.  When,  a\er  the  2010  elec(on,  Wilkie,  Rob  Oakeshoe,  Tony  Windsor  and  the  Greens  agreed  to  support  Labor,  they  gave  just  two  guarantees:  confidence  and  supply.  

you have a text, and you want to: 1.  find things that are

names: European Commission, John Lloyd Jones, etc.

2. give them labels: ORG, PERS, etc.  

Page 25: IE: Named Entity Recognition (NER)

•  A  very  important  sub-­‐task:  find  and  classify  names  in  text,  for  example:  

•  The  decision  by  the  independent  MP  Andrew  Wilkie  to  withdraw  his  support  for  the  minority  Labor  government  sounded  drama(c  but  it  should  not  further  threaten  its  stability.  When,  a\er  the  2010  elec(on,  Wilkie,  Rob  Oakeshoe,  Tony  Windsor  and  the  Greens  agreed  to  support  Labor,  they  gave  just  two  guarantees:  confidence  and  supply.  

Named  En$ty  Recogni$on  (NER)  

Person  Date  Loca(on  Organi-­‐          za(on      

Page 26: IE: Named Entity Recognition (NER)

Named  En$ty  Recogni$on  (NER)  

•  The  uses:  •  Named  en((es  can  be  indexed,  linked  off,  etc.  •  Sen(ment  can  be  aeributed  to  companies  or  products  •  A  lot  of  IE  rela(ons  are  associa(ons  between  named  en((es  •  For  ques(on  answering,  answers  are  o\en  named  en((es.  

•  Concretely:  •  Many  web  pages  tag  various  en((es,  with  links  to  bio  or  topic  pages,  etc.  •  Reuters’  OpenCalais,  Evri,  AlchemyAPI,  Yahoo’s  Term  Extrac(on,  …  

•  Apple/Google/Microso\/…  smart  recognizers  for  document  content  

Page 27: IE: Named Entity Recognition (NER)

Summary:  Gelng  simple  structured  informa(on  out  of  text  

Page 28: IE: Named Entity Recognition (NER)

Evalua$on  of  Named  En$ty  Recogni$on  

The  extension  of  Precision,  Recall,  and  the  F  measure  to  

sequences  

Page 29: IE: Named Entity Recognition (NER)

The  Named  En$ty  Recogni$on  Task  

Task:  Predict  en((es  in  a  text    

 Foreign    ORG    Ministry    ORG    spokesman    O    Shen      PER    Guofang    PER    told      O    Reuters    ORG    :      :  

}  Standard    evalua(on  is  per  en(ty,  not  per  token  

Page 30: IE: Named Entity Recognition (NER)

P/R  

30  

P=TP/TP+FP;  R=TP/TP+FN  FP=false  alarm  (it  is  not  a  NE,  but  it  has  been  classified  as  NE)  

FN  =it  is  true  that  it  is  a  NE,  but  d  system  failed  to  recognised  it  

Page 31: IE: Named Entity Recognition (NER)

Precision/Recall/F1  for  IE/NER  

•  Recall  and  precision  are  straighNorward  for  tasks  like  IR  and  text  categoriza(on,  where  there  is  only  one  grain  size  (documents)  

•  The  measure  behaves  a  bit  funnily  for  IE/NER  when  there  are  boundary  errors  (which  are  common):  •  First  Bank  of  Chicago  announced  earnings  …  

•  This  counts  as  both  a  fp  and  a  fn  •  Selec(ng  nothing  would  have  been  beeer  •  Some  other  metrics  (e.g.,  MUC  scorer)  give  par(al  credit  

(according  to  complex  rules)  

Page 32: IE: Named Entity Recognition (NER)

Summary:    Be  careful  when  interpre(ng  the  P/R/F1  measures  

Page 33: IE: Named Entity Recognition (NER)

Sequence  Models  for  Named  En$ty  Recogni$on  

Page 34: IE: Named Entity Recognition (NER)

The  ML  sequence  model  approach  to  NER  

Training  1.  Collect  a  set  of  representa(ve  training  documents  2.  Label  each  token  for  its  en(ty  class  or  other  (O)  3.  Design  feature  extractors  appropriate  to  the  text  and  classes  4.  Train  a  sequence  classifier  to  predict  the  labels  from  the  data    

Tes(ng  1.  Receive  a  set  of  tes(ng  documents  2.  Run  sequence  model  inference  to  label  each  token  3.  Appropriately  output  the  recognized  en((es  

 

Page 35: IE: Named Entity Recognition (NER)

NER  pipeline  

35  

Representa(ve  documents  

Human  annota(on   Annotated  

documents  

Feature  extrac(on  

Training  data  Sequence  classifiers  

NER  system  

Page 36: IE: Named Entity Recognition (NER)

Encoding  classes  for  sequence  labeling  

     IO  encoding  IOB  encoding    

 Fred      PER    B-­‐PER    showed    O    O    Sue      PER    B-­‐PER    Mengqiu    PER    B-­‐PER    Huang    PER    I-­‐PER    ‘s      O    O    new      O    O    pain(ng  O    O  

Page 37: IE: Named Entity Recognition (NER)

Features  for  sequence  labeling  

•  Words  •  Current  word  (essen(ally  like  a  learned  dic(onary)  •  Previous/next  word  (context)  

•  Other  kinds  of  inferred  linguis(c  classifica(on  •  Part-­‐of-­‐speech  tags  

•  Label  context  •  Previous  (and  perhaps  next)  label  

37  

Page 38: IE: Named Entity Recognition (NER)

Features:  Word  substrings  

4 17

14

4

241

drug

company

movie

place

person

Cotrimoxazole   Wethersfield  

Alien  Fury:  Countdown  to  Invasion  

000

18

0

oxa

708

0006

: 0 8

6

68

14

field

Page 39: IE: Named Entity Recognition (NER)

Features: Word shapes

•  Word Shapes •  Map words to simplified representation that encodes attributes

such as length, capitalization, numerals, Greek letters, internal punctuation, etc.

Varicella-zoster Xx-xxx

mRNA xXXX

CPA1 XXXd

Page 40: IE: Named Entity Recognition (NER)

Sequence  models  

•  Once  you  have  designed  the  features,  apply  a  sequence  classifier  (cf  PoS  tagging),  such  as:  •  Maximum  Entropy  Markov  Models  •  Condi(onal  Random  Fields  •  etc.  

40  

Page 41: IE: Named Entity Recognition (NER)

The end