Top Banner
SMT Part 7: Evalua/on Laura Jehl (Vertretung), most slides taken from hAp://www.statmt.org/book/
44

SMT Part 7: Evalua`on

Mar 18, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SMT Part 7: Evalua`on

SMT  Part  7:  Evalua/on  

Laura  Jehl  (Vertretung),  most  slides  taken  from    

hAp://www.statmt.org/book/    

Page 2: SMT Part 7: Evalua`on
Page 3: SMT Part 7: Evalua`on
Page 4: SMT Part 7: Evalua`on
Page 5: SMT Part 7: Evalua`on
Page 6: SMT Part 7: Evalua`on
Page 7: SMT Part 7: Evalua`on
Page 8: SMT Part 7: Evalua`on
Page 9: SMT Part 7: Evalua`on
Page 10: SMT Part 7: Evalua`on
Page 11: SMT Part 7: Evalua`on

Human  evalua/on  

•  low  cost?    (✗)  –  usually  turkers  or  researchers  •  tunable?    ✗  

•  meaningful?    ✔  

•  consistent?    ✗  

•  correct?    ✔  

Page 12: SMT Part 7: Evalua`on
Page 13: SMT Part 7: Evalua`on
Page 14: SMT Part 7: Evalua`on
Page 15: SMT Part 7: Evalua`on

85%  92%  

Page 16: SMT Part 7: Evalua`on
Page 17: SMT Part 7: Evalua`on
Page 18: SMT Part 7: Evalua`on

brevity  penalty  

geometric  mean  of  n-­‐gram  precisions  

Page 19: SMT Part 7: Evalua`on
Page 20: SMT Part 7: Evalua`on
Page 21: SMT Part 7: Evalua`on
Page 22: SMT Part 7: Evalua`on
Page 23: SMT Part 7: Evalua`on

Automa/c  evalua/on  

           Human    Automa/c  

•  low  cost?        (✗)      ✔  

•  tunable?      ✗ ✔  

•  meaningful?    ✔ ✗  

•  consistent?    ✗ ✔  

•  correct?        ✔ (✗)    

Page 24: SMT Part 7: Evalua`on
Page 25: SMT Part 7: Evalua`on
Page 26: SMT Part 7: Evalua`on
Page 27: SMT Part 7: Evalua`on
Page 28: SMT Part 7: Evalua`on
Page 29: SMT Part 7: Evalua`on
Page 30: SMT Part 7: Evalua`on
Page 31: SMT Part 7: Evalua`on
Page 32: SMT Part 7: Evalua`on

Core  Concepts  

•  Null  hypothesis:  Assump/on  that  there  is  no  real  difference  between  the  systems  

•  p-­‐level  (p-­‐value):  probability  of  seeing  the  observed  or  a  more  extreme  result  if  null-­‐hypothesis  is  true  – p-­‐level  <  0.01:  in  99%  of  cases  we  expect  to  see  a  less  extreme  result  if  null-­‐hyp.  is  true    

– at  a  p-­‐level  ≤  0.05    we  normally  say  that  there  is  a  significant  difference  

Page 33: SMT Part 7: Evalua`on

Tes/ng  for  significance  

•  Idea:  If  System  A  and  B  are  not  different,  then  randomly  swapping  transla/ons  between  them  produces  similar  scores.    

B  A  

|S(A)  -­‐  S(B)|  =  0.6  

Page 34: SMT Part 7: Evalua`on

Tes/ng  for  significance  

•  Idea:  If  System  A  and  B  are  not  different,  then  randomly  swapping  transla/ons  between  them  produces  similar  scores.    

B’  A’  

|S(A’)  -­‐  S(B’)|  <  0.6  ?    

Page 35: SMT Part 7: Evalua`on

Tes/ng  for  significance  

•  Repeat  this  many  /mes  and  count  the  number  of  /mes    that  |S’(A)-­‐S’(B)|  >  |S(A)-­‐S(B)|  

Page 36: SMT Part 7: Evalua`on

Tes/ng  for  significance  

•  This  test  is  called  Approximate  randomiza?on  test  

•  Usually  run  for  several  thousand  itera/ons  •  The  percentage  of  /mes    |S(A’)-­‐S(B’)|  >  |S(A)-­‐S(B)|  is  an  approxima/on  of  the  p-­‐level  

•  Rule  of  thumb:  A  BLEU  difference  of  1.0  or  more  is  significant    

Page 37: SMT Part 7: Evalua`on

source:  Stefan  Riezler,  SMT  course  notes  (2012)  

Page 38: SMT Part 7: Evalua`on
Page 39: SMT Part 7: Evalua`on
Page 40: SMT Part 7: Evalua`on
Page 41: SMT Part 7: Evalua`on

Games  with  a  purpose  

•  B:  The  trees  are  on  the  verge  of  its  greenery  or  drop  in  the  sky  and  clouds  gather  to  draw.  

Page 42: SMT Part 7: Evalua`on

Games  with  a  purpose  

•  A:  The  trees  are  bare  or  shortly  before  throw  their  leaves  and  draw  storm  clouds  on  the  sky.  

Page 43: SMT Part 7: Evalua`on

Summary  

•  Machine  transla/on  evalua/on  is  hard!  •  Human  evalua*on  is  meaningful  and  correct,  but  not  tunable  or  consistent    

•  Several  automa*c  evalua*on  measures  are  available  which  are  low-­‐cost,  tunable  and  consistent,  but  not  meaningful  

•  Correctness  of  automa/c  measures  can  be  evaluated  by  correla*on  with  human  judgments  

•  Significance  tests  should  be  used  to  determine  if  two  systems  are  really  different  

Page 44: SMT Part 7: Evalua`on

Important  concepts  

•  Adequacy,  Fluency  •  Kappa-­‐value  •  Human  vs.  automa/c  evalua/on  

•  BLEU-­‐Score  •  Pearson’s  Correla/on  •  Approximate  randomiza/on  test  

•  Task-­‐based  evalua/on