Top Banner
Machine Reading the Web Estevam R. Hruschka Jr. Federal University of São Carlos
208
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Reading the Web

Machine Readingthe Web  

Estevam  R.  Hruschka  Jr.  Federal  University  of  São  Carlos  

Page 2: Machine Reading the Web

Disclaimers  •  A  previous  version  of  this  tutorial  was  presented  at  

IBERAMIA2012  (h?p://iberamia2012.dsic.upv.es/tutorials/).  •  Feel  free  to  e-­‐mail  me  ([email protected])  with  

quesKons  about  this  tutorial  or  any  feedback/suggesKons/criKcisms.  Your  feedback  can  help  improving  the  quality  of  these  slides,  thus,  they  are  very  welcome.  

•  As  in  many  tutorials’  slides,  these  slides  were  prepared  to  be  presented,  and  la?er  studied.  Thus,  they  are  meant  to  be  more  self-­‐contained  than  slides  from  a  paper  presentaKon.  

Page 3: Machine Reading the Web

Disclaimers  •  Due  to  Kme  constraints,  I  do  not  intend  to  cover  all  the  

algorithms  and  publicaKons  related  to  YAGO,  KnowItAll  and  NELL.  What  I  do  intend,  instead,  is  to  give  an  overview  of  all  three  projects  and  what  is  the  main  approach  to  “Read  the  Web”,  used  in  each  project.    

•  YAGO,  KnowItAll  and  NELL  are  not  the  only  research  efforts  focusing  on  “Reading  the  Web”.  They  were  selected,  to  be  presented  in  this  tutorial,  because  they  show  three  different  and  very  relevant  approaches  to  this  problem,  but  it  does  not  mean  they  are  the  best  ones  at  all.  

Page 4: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 5: Machine Reading the Web

Outline  

• Machine  Learning  •  Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 6: Machine Reading the Web

Picture  taken  from  [Fern,  2008]    

Page 7: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 8: Machine Reading the Web

Outline  

•  Machine  Learning  • Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 9: Machine Reading the Web

Picture  taken  from  [DARPA,  2012]    

Page 10: Machine Reading the Web

Picture  taken  from  [DARPA,  2012]    

Page 11: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 12: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 13: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 14: Machine Reading the Web

The  YAGO-­‐NAGA  Project:  Harves?ng,  Searching,  and  Ranking  

Knowledge  from  the  Web    

Page 15: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 16: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 17: Machine Reading the Web

KnowItAll  

Page 18: Machine Reading the Web

KnowItAll:  Open  InformaKon  ExtracKon  

Page 19: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 20: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 21: Machine Reading the Web

NELL  

Page 22: Machine Reading the Web

Outline  

• Machine  Learning  •  Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 23: Machine Reading the Web

Machine  Learning  

•  What  is  Machine  Learning?  The  field  of  Machine  Learning  seeks  to  answer  the  quesKon  “How  can  we  build  computer  systems  that  automaKcally  improve  with  experience,  and  what  are  the  fundamental  laws  that  govern  all  learning  processes?”  [Mitchell,  2006]  

Page 24: Machine Reading the Web

Machine  Learning  

•  What  is  Machine  Learning?  a  machine  learns  with  respect  to  a  parKcular:  -­‐  task  T    -­‐  performance  metric  P  -­‐  type  of  experience  E      if  the  system  reliably  improves  its  performance  P  at  task  T,  following  experience  E.  [Mitchell,  1997]  

Page 25: Machine Reading the Web

Machine  Learning  

•  Examples  of  Machine  Learning  approaches  for  different  tasks  (T),  performance  metrics  (P)  an  experiences  (E)  

-­‐  data  mining  -­‐  autonomous  discovery  -­‐  database  updaKng  -­‐  programming  by  example  -­‐  Pa?ern  recogniKon    

Page 26: Machine Reading the Web

Machine  Learning  

•  Supervised  Learning;  •  Unsupervised  Learning  •  Semi-­‐Supervised  Learning  

Page 27: Machine Reading the Web

Supervised  Learning  

Page 28: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 29: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 30: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 31: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 32: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 33: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 34: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 35: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 36: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 37: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 38: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 39: Machine Reading the Web

Supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Page 40: Machine Reading the Web

Unsupervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Page 41: Machine Reading the Web

Unsupervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Page 42: Machine Reading the Web

Unsupervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Page 43: Machine Reading the Web

Unsupervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Page 44: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 45: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 46: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 47: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 48: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 49: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 50: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 51: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 52: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 53: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 54: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 55: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 56: Machine Reading the Web

Semi-­‐supervised  Learning  

0  

5  

10  

15  

20  

25  

0   5   10   15   20   25  

Series1  

Series2  

Unlabeled  

Page 57: Machine Reading the Web

Outline  

•  Machine  Learning  • Machine  Reading  •  Reading  the  Web  

– YAGO  – KnowItAll  – NELL  

Page 58: Machine Reading the Web

Machine  Reading  

•  “The  autonomous  understanding  of  text”  [Etzioni  et  al.,  2007]  

•  “One  of  the  most  important  methods  by  which  human  beings  learn  is  by  reading”  [Clark  et  al.,  2007],  thus  why  not  building  machines  capable  of  learning  by  reading?  

Page 59: Machine Reading the Web

Machine  Reading  

•  “The  problem  of  deciding  what  was  implied  by  a  wri?en  text,  of  reading  between  the  lines  is  the  problem  of  inference.”  [Norvig,  2007]  

 •  Typically,  Machine  Reading  is  different  from  Natural  Language  Processing  alone  

Page 60: Machine Reading the Web

Machine  Reading  

Page 61: Machine Reading the Web

Machine  Reading  

Page 62: Machine Reading the Web

Machine  Reading  

Page 63: Machine Reading the Web

Machine  Reading  

Page 64: Machine Reading the Web

Machine  Reading  

•  One  important  approach  to  machine  reading  is  to  extract  facts  from  text  and  store  them  in  a  structured  form.  

•  Facts  can  be  seen  as  enKKes  and  their  relaKons  

•  Ontology  is  one  of  the  most  common  representaKon  for  the  extracted  facts  

 

Page 65: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 66: Machine Reading the Web

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Page 67: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 68: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 69: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

affairWith  

affairWith   enemyOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 70: Machine Reading the Web

Machine  Reading  

•  Ontology  RepresentaKon  

•  Named  EnKty  ResoluKon/ExtracKon  

•  RelaKon  ExtracKon    

Page 71: Machine Reading the Web

Machine  Reading  

•  Ontology  RepresentaKon  

 Facts  (RDF  triples)  1:  (Jim,  hasAdvisor,  Mike)  2:  (Surajit,  hasAdvisor,  Jeff)  3:  (Madonna,  marriedTo,  GuyRitchie)  4:  (Nicolas,  marriedTo,  Carla)  5:  (ManchesterU,  wonCup,  ChampionsLeague)  

ReificaKon:  “Facts  about  Facts”:  6:      (1,  inYear,  1968)  7:      (2,  inYear,  2006)  8:      (3,  validFrom,  22-­‐Dec-­‐2000)    9:      (3,  validUnKl,  Nov-­‐2008)  10:  (4,  validFrom,  2-­‐Feb-­‐2008)  11:  (2,  source,  SigmodRecord)  12:  (5,  inYear,  1999)  13:  (5,  locaKon,  CampNou)  14:  (5,  source,  Wikipedia)  

Page 72: Machine Reading the Web

Machine  Reading  

•  Named  EnKty  ResoluKon  [Theobald  &  Weikum,  2012]  – Which  individual  enKKes  belong  to  which  classes?  

•  instanceOf  (Surajit  Chaudhuri,  computer  scien>sts),  •  instanceOf  (BarbaraLiskov,  computer  scien>sts),  •  instanceOf  (Barbara  Liskov,  female  humans),  …  

– Which  names  denote  which  enKKes?  •  means  (“Lady  Di“,  Diana  Spencer),  •  means  (“Diana  Frances  MountbaGen-­‐Windsor”,  Diana  Spencer),  

…  •  means  (“Madonna“,  Madonna  Louise  Ciccone),  •  means  (“Madonna“,  Madonna(pain>ng  by  Edward  Munch)),  …  

Page 73: Machine Reading the Web

Machine  Reading  

•  RelaKon  ExtracKon  [Theobald  &  Weikum,  2012]  – Which  instances  (pairs  of  individual  enKKes)  are  there  for  given  binary  relaKons  with  specific  type  signatures?  •  hasAdvisor  (JimGray,  MikeHarrison)  •  hasAdvisor  (HectorGarcia-­‐Molina,  Gio  Wiederhold)  •  hasAdvisor  (Susan  Davidson,  Hector  Garcia-­‐Molina)  •  graduatedAt  (JimGray,  Berkeley)  •  graduatedAt  (HectorGarcia-­‐Molina,  Stanford)  •  hasWonPrize  (JimGray,  TuringAward)  •  bornOn  (JohnLennon,  9Oct1940)  •  diedOn  (JohnLennon,  8Dec1980)  •  marriedTo  (JohnLennon,  YokoOno)  

Page 74: Machine Reading the Web

Machine  Reading  

•  RelaKon  Discovery  – Which  new  relaKons  are  there  for  given  pair  of  enKKes?  •  hasAdvisor  (JimGray,  MikeHarrison)  

Page 75: Machine Reading the Web

Machine  Reading  

•  RelaKon  Discovery  – Which  new  relaKons  are  there  for  given  pair  of  enKKes?  •  hasAdvisor  (JimGray,  MikeHarrison)  •  hasCoAuthor(HectorGarcia-­‐Molina,  Gio  Wiederhold)  

Page 76: Machine Reading the Web

Machine  Reading  

•  RelaKon  Discovery  – Which  new  relaKons  are  there  for  given  pair  of  enKKes?  •  hasAdvisor  (JimGray,  MikeHarrison)  •  hasCoAuthor(HectorGarcia-­‐Molina,  Gio  Wiederhold)  •  graduatedAt  (JimGray,  Berkeley)  

Page 77: Machine Reading the Web

Machine  Reading  

•  RelaKon  Discovery  – Which  new  relaKons  are  there  for  given  pair  of  enKKes?  •  hasAdvisor  (JimGray,  MikeHarrison)  •  hasCoAuthor(HectorGarcia-­‐Molina,  Gio  Wiederhold)  •  graduatedAt  (JimGray,  Berkeley)  •  studiedAt  (HectorGarcia-­‐Molina,  Stanford)  •  bornOn  (JohnLennon,  9Oct1940)  •  releasedAlbum  (JohnLennon,  10Dec1965)  

Page 78: Machine Reading the Web

Machine  Reading  •  Named  EnKty  ResoluKon/ExtracKon  and  RelaKon  ExtracKon  –  Semi-­‐structured  data  

The  “Low-­‐Hanging  Fruit”  •  Wikipedia  infoboxes  &  categories  •  HMTL  lists  &  tables,  etc.  

 –  Free  text  

•  Hearst-­‐pa?erns;  clustering  by  verbal  phrases  •  Natural-­‐language  processing  •  Advanced  pa?erns  &  iteraKve  bootstrapping    (“Dual  IteraKve  Pa?ern  RelaKon  ExtracKon”)  

–    POS  tagging  and  NP  chunking:  

Page 79: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 80: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 81: Machine Reading the Web

The  YAGO-­‐NAGA  Project:  Harves?ng,  Searching,  and  Ranking  

Knowledge  from  the  Web    

Page 82: Machine Reading the Web

The  YAGO-­‐NAGA  Project:  Harves?ng,  Searching,  and  Ranking  

Knowledge  from  the  Web    

Page 83: Machine Reading the Web

YAGO  

•  Yet  Another  Great  Ontology  -­‐  YAGO  •  Main  Goal:  building  a  conveniently  searchable,  large-­‐scale,  highly  accurate  knowledge  base  of  common  facts  in  a  machine-­‐processable  representaKon  

Page 84: Machine Reading the Web

YAGO  

•  Turn  Web  into  Knowledge  Base  [Weikum  et  al.,  2009]  – Building  a  comprehensive  Knowledge  Base  of  human  knowledge  

– knowledge  from  Wikipedia  and  WordNet  –  the  ontology  check  itself  for  precision    

Page 85: Machine Reading the Web

YAGO  

•  The  knowledge  base  is  automaKcally  constructed  from  Wikipedia  

•  Each  arKcle  in  Wikipedia  becomes  an  enKty  in  the  kb  (e.g.,  since  Leonard  Cohen  has  an  arKcle  in  Wikipedia,  LeonardCohen  becomes  an  enKty  in  YAGO).    

Page 86: Machine Reading the Web

YAGO  

Page 87: Machine Reading the Web

YAGO  Free  Text  

Page 88: Machine Reading the Web

YAGO  Free  Text  

Page 89: Machine Reading the Web

YAGO  Free  Text  

InfoBox  

Page 90: Machine Reading the Web

YAGO  Wikipedia  InfoBox  

Page 91: Machine Reading the Web

YAGO  Wikipedia  InfoBox  

Semi-­‐structured  data  The  “Low-­‐Hanging  Fruit”  

Page 92: Machine Reading the Web

YAGO  Wikipedia  InfoBox  

Semi-­‐structured  data  The  “Low-­‐Hanging  Fruit”  

Page 93: Machine Reading the Web

YAGO  

•  Certain  categories  are  exploited  to  deliver  type  informaKon  (e.g.,  the  arKcle  about  Leonard  Cohen  is  in  the  category  Canadian  poets,  so  he  becomes  a  Canadian  poet).    

Page 94: Machine Reading the Web

YAGO  

Page 95: Machine Reading the Web

YAGO  

Page 96: Machine Reading the Web

YAGO  •  For  each  category  of  a  page  [Hoffart  et  al.,  2012]  

–  Using  shallow  parsing,  determine  the  head  word  of  the  category  name.  In  the  example  of  Canadian  poets,  the  head  word  is  poets.    

–  If  the  head  word  is  in  plural,  then  proposes  the  category  as  a  class  and  the  arKcle  enKty  as  an  instance    

–  Link  the  class  to  the  WordNet  taxonomy  (most  frequent  sense  of  the  head  word  in  WordNet)  

•  only  countable  nouns  can  appear  in  plural  form  •  only  countable  nouns  can  be  ontological  classes  •  themaKc  categories  (such  as  Canadian  poetry)  are  different  from  conceptual  Categories  

Page 97: Machine Reading the Web

YAGO  

•  head  words  that  are  not  conceptual  even  though  they  appear  in  plural  (such  as  stubs  in  Canadian  poetry  stubs)  are  in  the  first  list  of  excepKons.    

•  words  that  do  not  map  to  their  most  frequent  sense,  but  to  a  different  sense  are  in  the  second  excepKon  list  –  The  word  capital,  e.g.,  refers  to  the  main  city  of  a  country  in  the  majority  of  cases  and  not  to  the  financial  amount,  which  is  the  most  frequent  sense  in  WordNet.  

Page 98: Machine Reading the Web

YAGO  •  About  100  manually  defined  relaKons  

–  wasBornOnDate    –  locatedIn    –  hasPopulaKon    

•  Categories  and  infoboxes  are  exploited  to  deliver  facts  (instances  of  relaKons).    

•  Manually  defined  pa?erns  that  map  categories  and  infobox  a?ributes  to  fact  templates  –  infobox  a?ribute  born=Montreal,  thus  wasBornIn(LeonardCohen,  Montreal)    

•  Pa?ern-­‐based  extracKons  resulted  in  2  million  extracted  enKKes  and  20  million  facts  

Page 99: Machine Reading the Web

YAGO  •  Based  on  declaraKve  rules  (stored  in  text  files)  •  The  rules  take  the  form  of  subject-­‐  predicate-­‐object  triples,  so  that  they  are  basically  addiKonal  facts  

•  There  are  different  types  of  rules  

Page 100: Machine Reading the Web

YAGO  •  Factual  rules:    definiKon  of  all  relaKons,  their  domains  and  

ranges,  and  the  definiKon  of  the  classes  that  make  up  the  YAGO  hierarchy  of  literal  types.  

•  Implica?on  rules:  express  that  if  certain  facts  appear  in  the  knowledge  base,  then  another  fact  shall  be  added.  Horn  clause  rules.  

•  Replacement  rules:  for  interpreKng  micro-­‐formats,  cleaning  up  HTML  tags,  and  normalizing  numbers.  

•  Extrac?on  rules:  apply  primarily  to  pa?erns  found  in  the  Wikipedia  infoboxes,  but  also  to  Wikipedia  categories,  arKcle  Ktles,  and  even  other  regular  elements  in  the  source  such  as  headings,  links,  or  references.  

Page 101: Machine Reading the Web

YAGO  •  AutomaKcally  verifies  consistency  

– Check  uniqueness  of  funcKonal  arguments  •  spouse(x,y)  ∧  diff(y,z)  ⇒  ¬spouse(x,z)  

– Check  domains  and  ranges  of  relaKons  •  spouse(x,y)  ⇒  female(x)  •  spouse(x,y)  ⇒  male(y)  •  spouse(x,y)  ⇒  (f(x)∧m(y))  ∨  (m(x)∧f(y))    

 

Page 102: Machine Reading the Web

YAGO  •  AutomaKcally  verifies  consistency  

– Hard  Constraint  •  hasAdvisor(x,y)  ∧  graduatedInYear(x,t)  ∧  graduatedInYear(y,s)  ⇒  s  <  t  

– Sor  Constraint    •  firstPaper(x,p)  ∧  firstPaper(y,q)  ∧  author(p,x)  ∧  author(p,y)  )  ∧    

 inYear(p)  >  inYear(q)  +  5years  ⇒  hasAdvisor(x,y)  [0.6]  

 

Page 103: Machine Reading the Web

YAGO  

•  Ontology  RepresentaKon  – EnKKes  and  RelaKons  of  public  interest  – Format:  TSV,  RDF,  XML,  N3,  Web  Interface  – Learns  

•  Instances  and  pa?erns  from  Wikipedia;  •  Taxonomy  from  WordNet;  •  Geotags  informaKon  from  Geonames.  

Page 104: Machine Reading the Web

YAGO  

•  Named  EnKty  ResoluKon/ExtracKon  [Theobald  &  Weikum,  2012]  – Based  on  rules  and  pa?erns  extracted  from  Wikipedia  

– DisambiguaKon  is  a  relevant  issue  – Semi-­‐structured  data  

The  “Low-­‐Hanging  Fruit”  • Wikipedia  infoboxes  &  categories  •  HMTL  lists  &  tables,  etc.  

Page 105: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

   

       

   

   

   

   

   

   

   

   

   

   

       

   

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 106: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 107: Machine Reading the Web

YAGO  

•  RelaKon  ExtracKon  [Theobald  &  Weikum,  2012]  – Based  on  rules  and  pa?erns  extracted  from  Wikipedia  

– Semi-­‐structured  data  The  “Low-­‐Hanging  Fruit”  • Wikipedia  infoboxes  &  categories  •  HMTL  lists  &  tables,  etc.  

Page 108: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 109: Machine Reading the Web

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Page 110: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 111: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 112: Machine Reading the Web

YAGO  

•  YAGO2:  Exploring  and  Querying  World  Knowledge  in  Time,  Space,  Context,  and  Many  Languages  – New  relaKons  specifically  designed  to  cover  Kme,  space  and  context  

– Wikipedia  translated  pages  as  sources  for  other  languages  

Page 113: Machine Reading the Web

YAGO  

•  More  on  YAGO:  – Very  nice  tutorials:  

•  "SemanKc  Knowledge  Bases  from  Web  Sources"  at  IJCAI  2011,  Barcelona,  July  2011  "HarvesKng  Knowledge  from  Web  Data  and  Text"  at  CIKM  2010,  Toronto,  October  2010  "From  InformaKon  to  Knowledge:  HarvesKng  EnKKes  and  RelaKonships  from  Web  Sources"  at  PODS  2010,  Indianapolis,  June  2010  

– Project  Website:  •  hWp://www.mpi-­‐inf.mpg.de/yago-­‐naga/  

Page 114: Machine Reading the Web

YAGO  •  More  on  YAGO  (hWp://www.mpi-­‐inf.mpg.de/yago-­‐naga/)  

Page 115: Machine Reading the Web

YAGO  •  More  on  YAGO  (hWp://www.mpi-­‐inf.mpg.de/yago-­‐naga/)  

Page 116: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 117: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 118: Machine Reading the Web

KnowItAll  

Page 119: Machine Reading the Web

KnowItAll:  Open  InformaKon  ExtracKon  

Page 120: Machine Reading the Web

KnowItAll:  Open  InformaKon  ExtracKon  

Page 121: Machine Reading the Web

KnowItAll  

•  MoKvaKon:  New  Paradigm  for  Search  [Etzioni,  2008]  

–  The  future  of  Web  Search  –  Read  the  Web  instead  of  retrieving  Web  pages  to  perform  Web  Search  

Page 122: Machine Reading the Web

KnowItAll  

•  InformaKon  ExtracKon  (IE)  +  tractable  inference    

–  IE(sentence)  =  who  did  what?  •  speaker(P.  Smith,  ECMLPKDD2012)  

–  Inference  =  uncover  implicit  informaKon  •  Will  Pi?sburgh  Steelers  be  champions  again?    

•  Open  InformaKon  ExtracKon  [Banko  et  al.,  2007]  

Page 123: Machine Reading the Web

Open  InformaKon  ExtracKon    [Banko  et  al.,  2007]  

•  Open  IE  systems  avoid  specific  nouns  and  verbs    •  Extractors  are  unlexicalized—formulated  only  in  terms  of:  

–   syntacKc  tokens  (e.g.,  part-­‐of-­‐speech  tags)    –  closed-­‐word  classes  (e.g.,  of,  in,  such  as).    

•  Open  IE  extractors  focus  on  generic  ways  in  which  relaKonships  are  expressed  in  English  

–  naturally  generalizing  across  domains.  

Page 124: Machine Reading the Web

Open  InformaKon  ExtracKon    

•  Open  IE  systems  are  tradiKonally  based  on    three  steps  [Etzioni  et  al.,  2011]:  –  1.  Label:  Sentences  are  automaKcally  labeled  with  extracKons  using  heurisKcs  or  distant  supervision.  

–  2.  Learn:  A  relaKon  phrase  extractor  is  learned  using  a  sequence-­‐labeling  graphical  model  (e.g.,  CRF).  

–  3.  Extract:  given  a  sentence  as  input,  idenKfies  a  candidate  pair  of  NP  arguments  (Arg1,  Arg2)  from  the  sentence,  and  then  uses  the  learned  extractor  to  label  each  word  between  the  two  arguments  as  part  of  the  relaKon  phrase  or  not.  

Page 125: Machine Reading the Web

Open  InformaKon  ExtracKon    

•  TextRunner  [Banko  et  al.,  2007]  was  the  first  OIE  system;  

•  OIE  became  the  main  focus  of  the  KnowItAll  project;  

•  Two  main  problems:    –  incoherent  extracKons;  – uninformaKve  relaKons  

Page 126: Machine Reading the Web

Open  InformaKon  ExtracKon    

•  incoherent  extracKons  

Page 127: Machine Reading the Web

Open  InformaKon  ExtracKon    

•  uninformaKve  relaKons  

Page 128: Machine Reading the Web

Open  InformaKon  ExtracKon    •  TextRunner  was  based  on  

Page 129: Machine Reading the Web

OIE:  the  second  generaKon  •  New  syntacKc  constraint  based  on  POS  tag  pa?erns  

•  simple  verb  phrase  (e.g.,  invented)  •  verb  phrase  followed  immediately  by  a  preposiKon  or  

parKcle  (e.g.,  located  in)  •  verb  phrase  followed  by  a  simple  noun  phrase  and  ending  

in  a  preposiKon  or  parKcle  (e.g.,  has  atomic  weight  of)  •  mulKple  possible  matches,  then  the  longest  possible  match  

is  chosen.  

Page 130: Machine Reading the Web

OIE:  the  second  generaKon  •  New  lexical  constraint  to  separate  valid  relaKon  phrases  from  over-­‐specified  relaKon  phrases  

•  The  lexical  constraint  is  based  on  the  intuiKon  that  a  valid  relaKon  phrase  should  take  many  disKnct  arguments  in  a  large  corpus.  

Page 131: Machine Reading the Web

OIE:  the  second  generaKon  •  New  OIE  System:  ReVerb  [Fader  et  al.,  2011]  

–  Input:  a  POS-­‐tagged  and  NP-­‐chunked  sentence  – Output:  a  set  of  (x,r,y)  extracKon  triples  – Based  on  two  extracKon  algorithm:  

•  1.  RelaKon  ExtracKon:  based  on  the  new  constraints  •  2.  Argument  ExtracKon:  For  each  relaKon  phrase  r  iden-­‐  Kfied  in  Step  1,  find  the  nearest  noun  phrase  x  to  the  ler  and  the  nearest  noun  phrase  y  to  the  right  of  r  in  s.    

Page 132: Machine Reading the Web

OIE:  the  second  generaKon  •  New  OIE  System:  ReVerb  [Fader  et  al.,  2011]    

Page 133: Machine Reading the Web

OIE:  the  second  generaKon  

Page 134: Machine Reading the Web

OIE:  the  second  generaKon  Table  extracted  from  [Etzioni  et  al.,  2011]  

Page 135: Machine Reading the Web

OIE:  the  second  generaKon  •  New  OIE  System:  ArgLearner  [Etzioni  et  al.,  2011]  

Page 136: Machine Reading the Web

OIE:  the  second  generaKon  •  New  OIE  System:    •  ReVerb  +  ArgLearner  =  R2A2  [Etzioni  et  al.,  2011]    

Page 137: Machine Reading the Web

OIE:  the  second  generaKon  •  New  OIE  System:    •  ReVerb  +  ArgLearner  =  R2A2  [Etzioni  et  al.,  2011]     Free  text  

Hearst-­‐paWerns;  clustering  by  verbal  phrases  Natural-­‐language  processing  Advanced  paWerns  &  itera?ve  bootstrapping  

 (“Dual  Itera?ve  PaWern  Rela?on  Extrac?on”)  

 POS  tagging  and  NP  chunking:  

Page 138: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  OIE  

       

       

       

   

   

   

   

   

   

   

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 139: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  OIE    

       

       

       

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 140: Machine Reading the Web

Machine  Reading  with  OIE    

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Page 141: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  OIE  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 142: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  OIE  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

affairWith  

affairWith   enemyOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 143: Machine Reading the Web

More  on  KnowItAll  

WWW2013                                                                                                                                                              Machine  Reading  the  Web                                                                                                                  Estevam  R.  Hruschka  Jr.  

•  h?p://homes.cs.washington.edu/~etzioni/index.html  

Page 144: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 145: Machine Reading the Web

Outline  

•  Machine  Learning  •  Machine  Reading  

•  Reading  the  Web  – YAGO  – KnowItAll  – NELL  

Page 146: Machine Reading the Web

Never-­‐Ending  Learning  Language  

Page 147: Machine Reading the Web
Page 148: Machine Reading the Web

Never-­‐Ending  Learning  •  Main Task: acquire  a  growing  competence  without  asymptote    •  over  years  •  mulKple  funcKons  •  where  learning  one  thing  improves  ability  to  learn  the  next    •  acquiring  data  from  humans,  environment    

•  Many  candidate  domains:    •  Robots    •  Sorbots    •  Game  players    

Page 149: Machine Reading the Web

NELL:  Never-­‐Ending  Language  Learner  

Inputs: l     initial ontology      l     handful of examples of each predicate in ontology l     the web l     occasional interaction with human trainers

The task:

l     run 24x7, forever •    each day: 1.    extract more facts from the web to populate the initial ontology 2.    learn to read (perform #1) better than yesterday

Page 150: Machine Reading the Web

NELL:  Never-­‐Ending  Language  Learner  

Goal: •    run 24x7, forever •    each day:

1.    extract more facts from the web to populate given ontology 2.    learn to read better than yesterday

Today... Running 24 x 7, since January, 2010 Input: •    ontology defining ~800 categories and relations •    10-20 seed examples of each •    1 billion web pages (ClueWeb – Jamie Callan) Result: •    continuously growing KB with +1,400,000 extracted beliefs

Page 151: Machine Reading the Web

h?p://rtw.ml.cmu.edu  

Page 152: Machine Reading the Web

NELL:  Never-­‐Ending  Language  Learner  

Page 153: Machine Reading the Web

The  Problem  with  Semi-­‐Supervised  Bootstrap  Learning  

Paris  Pi?sburgh  Sea?le  CuperKno  

Page 154: Machine Reading the Web

The  Problem  with  Semi-­‐Supervised  Bootstrap  Learning  

Paris  Pi?sburgh  Sea?le  CuperKno  

mayor  of    arg1  live  in    arg1  

Page 155: Machine Reading the Web

The  Problem  with  Semi-­‐Supervised  Bootstrap  Learning  

Paris  Pi?sburgh  Sea?le  CuperKno  

mayor  of    arg1  live  in    arg1  

San  Francisco  AusKn  denial  

Page 156: Machine Reading the Web

The  Problem  with  Semi-­‐Supervised  Bootstrap  Learning  

Paris  Pi?sburgh  Sea?le  CuperKno  

mayor  of    arg1  live  in    arg1  

San  Francisco  AusKn  denial  

arg1  is  home  of  traits  such  as  arg1  

Page 157: Machine Reading the Web

The  Problem  with  Semi-­‐Supervised  Bootstrap  Learning  

Paris  Pi?sburgh  Sea?le  CuperKno  

mayor  of    arg1  live  in    arg1  

…  

San  Francisco  AusKn  denial  

arg1  is  home  of  traits  such  as  arg1  

it’s underconstrained!!

Page 158: Machine Reading the Web

Key Idea 1: Coupled semi-supervised training of many functions

Page 159: Machine Reading the Web

Coupled Training Type 1: Co-training, Multiview, Co-regularization

Page 160: Machine Reading the Web

Coupled Training Type 1: Co-training, Multiview, Co-regularization

Page 161: Machine Reading the Web

Coupled Training Type 1: Co-training, Multiview, Co-regularization

Page 162: Machine Reading the Web

Type 1 Coupling Constraints in NELL

Page 163: Machine Reading the Web

Type 1 Coupling Constraints in NELL

Semi-­‐structured  data  The  “Low-­‐Hanging  Fruit”  

Page 164: Machine Reading the Web

Type 1 Coupling Constraints in NELL

Semi-­‐structured  data  The  “Low-­‐Hanging  Fruit”  

Free  text  Hearst-­‐paWerns;  clustering  by  verbal  phrases  Natural-­‐language  processing  Advanced  paWerns  &  itera?ve  bootstrapping  

 (“Dual  Itera?ve  PaWern  Rela?on  Extrac?on”)  

 POS  tagging  and  NP  chunking:  

Page 165: Machine Reading the Web

Coupled  Training  Type  2:  Structured  Outputs,  MulKtask,  Posterior  RegularizaKon,  

MulKlabel  

Learn  funcKons  with  the  same  input,  different  outputs,  where  we  know  some  constraint  

Page 166: Machine Reading the Web

Coupled  Training  Type  2:  Structured  Outputs,  MulKtask,  Posterior  RegularizaKon,  

MulKlabel  

Learn  funcKons  with  the  same  input,  different  outputs,  where  we  know  some  constraint  

Page 167: Machine Reading the Web

Coupled  Training  Type  2:  Structured  Outputs,  MulKtask,  Posterior  RegularizaKon,  

MulKlabel  

Learn  funcKons  with  the  same  input,  different  outputs,  where  we  know  some  constraint  

Page 168: Machine Reading the Web

Type 2 Coupling Constraints in NELL

Page 169: Machine Reading the Web

Multi-view, Multi-Task Coupling C categories, V views, CV ≈ 250*3=750 coupled functions pairwise constraints on functions ≈ 105

Page 170: Machine Reading the Web

Learning Relations between NP’s

Page 171: Machine Reading the Web

Learning Relations between NP’s

Page 172: Machine Reading the Web

Type  3  Coupling:  Argument  Types

Page 173: Machine Reading the Web

Pure  EM  Approach  to  Coupled  Training  

E: jointly estimate latent labels for each function of each

unlabeled example M: retrain all functions, based

on these probabilistic labels

Scaling problem: • E step: 20M NP’s, 1014 NP pairs to label • M step: 50M text contexts to consider for each function à 1010

parameters to retrain • even more URL-HTML contexts..

Page 174: Machine Reading the Web

NELL’s  ApproximaKon  to  EM  

E’ step: • Consider only a growing subset of the latent variable assignments

– category variables: up to 250 NP’s per category per iteration – relation variables: add only if confident and args of correct type – this set of explicit latent assignments *IS* the knowledge base

M’ step: • Each view-based learner retrains itself from the updated KB • “context” methods create growing subsets of contexts

Page 175: Machine Reading the Web
Page 176: Machine Reading the Web

Key Idea 2: Discover New Coupling Constraints

•    first order, probabilistic horn clause constraints

0.93 athletePlaysSport(?x,?y) :- athletePlaysForTeam(?x,?z), teamPlaysSport(?z,?y)

–    connects previously uncoupled relation predicates

–    infers new beliefs for KB

Page 177: Machine Reading the Web

Example  Learned  Horn  Clauses  0.95 athletePlaysSport(?x,basketball) :- athleteInLeague(?x,NBA) 0.93 athletePlaysSport(?x,?y) :- athletePlaysForTeam(?x,?z)

teamPlaysSport(?z,?y) 0.91 teamPlaysInLeague(?x,NHL) :- teamWonTrophy(?x,Stanley_Cup)  0.90  athleteInLeague(?x,?y):- athletePlaysForTeam(?x,?z),

teamPlaysInLeague(?z,?y) 0.88 cityInState(?x,?y) :- cityCapitalOfState(?x,?y),

cityInCountry(?y,USA) 0.62* newspaperInCity(?x,New_York) :- companyEconomicSector(?x,media),

generalizations(?x,blog)

Page 178: Machine Reading the Web

Learned  ProbabilisKc  Horn  Clause  Rules  

Page 179: Machine Reading the Web

Learned  ProbabilisKc  Horn  Clause  Rules  

Page 180: Machine Reading the Web
Page 181: Machine Reading the Web

Ontology Extension (1)

Page 182: Machine Reading the Web

OntExt (Ontology Extension)

Everything

Person Company City Sport

WorksFor   PlayedIn  

Page 183: Machine Reading the Web

OntExt (Ontology Extension)

Everything

Person Company City Sport

WorksFor   PlayedIn  Plays  

Page 184: Machine Reading the Web

OntExt (Ontology Extension)

Everything

Person Company City Sport

WorksFor   PlayedIn  

LocatedIn  

Plays  

Page 185: Machine Reading the Web

[Mohamed & Hruschka, 2011]

Goal: •    Discover frequently stated relations among

ontology categories Approach: •    For each pair of categories C1, C2, •    co-cluster pairs of known instances, and text

contexts that connect them

* additional experiments with Etzioni & Soderland using TextRunner

Ontology Extension (1)

Page 186: Machine Reading the Web
Page 187: Machine Reading the Web

Prophet  

•  Mining  the  Graph  represenKng  NELL’s  KB  to:  1.  Extend  the  KB  by  predicKng  new  relaKons  

(edges)that  might  exist  between  pairs  of  nodes;  

2.  Induce  inference  rules;  3.  IdenKfy  misplaced  edges  which  can  be  used  

by  NELL  as  hints  to  idenKfy  wrong  connecKons  between  nodes  (wrong  fats);  

•     

Appel  &  Hruschka,  2012  

Page 188: Machine Reading the Web

Prophet  

•  Find  open  triangles  in  the  Graph  

Appel  &  Hruschka  

Page 189: Machine Reading the Web

Prophet  

sport   sportsLeague  

sportsTeam  

Appel  &  Hruschka  

Page 190: Machine Reading the Web

Prophet  

•  If                                          >  ξ    then  create  the  new  relaKon  •  ξ  =  10  (empirically)  

sport   sportsLeague  

sportsTeam  

Appel  &  Hruschka  

Page 191: Machine Reading the Web

Prophet  

•  If                                          >  ξ    then  create  the  new  relaKon  •  ξ  =  10  (empirically)  •  Name  the  new  relaKon  based  on  ReVerb  

sport   sportsLeague  

sportsTeam  

isPlayedIn  

Appel  &  Hruschka  

Page 192: Machine Reading the Web

Conversing  Learning  Pedro  &  Hruschka  

Page 193: Machine Reading the Web

Conversing  Learning  

•  Help  to  supervise  NELL  by  automaKcally  asking  quesKons  on  Web  CommuniKes  

Pedro  &  Hruschka  

Page 194: Machine Reading the Web

Conversing  Learning  

•  Help  to  supervise  NELL  by  automaKcally  asking  quesKons  on  Web  CommuniKes  

•  Currently:  validate  First  Order  Rules  coming  from  Rule  Learner  

Pedro  &  Hruschka  

Page 195: Machine Reading the Web

Conversing  Learning  

•  Help  to  supervise  NELL  by  automaKcally  asking  quesKons  on  Web  CommuniKes  

•  Currently:  validate  First  Order  Rules  coming  from  Rule  Learner  

Pedro  &  Hruschka  

Page 196: Machine Reading the Web

Conversing  Learning  

•  Help  to  supervise  NELL  by  automaKcally  asking  quesKons  on  Web  CommuniKes  

•  Currently:  validate  First  Order  Rules  coming  from  Rule  Learner  

Pedro  &  Hruschka  

Page 197: Machine Reading the Web

Conversing  Learning  

•  Uses  an  agent  (SS-­‐Crowd)  capable  of:    – building  quesKons;  – PosKng  quesKons  in  Web  communiKes;  – Fetch  answers;  – Understand  the  answers;  – Decide  on  the  truth  of  the  first  order  rule    

Pedro  &  Hruschka  

Page 198: Machine Reading the Web

Conversing  Learning  Pedro  &  Hruschka  

Page 199: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  NELL  

       

       

       

   

   

   

   

   

   

   

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 200: Machine Reading the Web

Machine  Reading  with  NELL  

       

       

       

   

   

   

   

   

   

   

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Page 201: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  NELL  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

   

           

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 202: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  NELL  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 203: Machine Reading the Web

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Machine  Reading  with  NELL  

       

       

       

   

   

   

   

   

   

   

same  same   same  

same  same  

same  

   

   

       

       

   

uncleOf  

   

   

owns  

hires  

       

headOf  

affairWith  

affairWith   enemyOf  

This  slide  was  adapted  from  [Hady  et  al.,  2011]    

Page 204: Machine Reading the Web

More  on  NELL  •  h?p://rtw.ml.cmu.edu/rtw/publicaKons  

WWW2013                                                                                                                                                              Machine  Reading  the  Web                                                                                                                  Estevam  R.  Hruschka  Jr.  

Page 205: Machine Reading the Web

[email protected] Thank you very much! and thanks to all people from NELL, KnowItAll and YAGO projects for very nice discussions and suggestions to this tutorial.

Page 206: Machine Reading the Web

References  •  [Fern,  2008]  Xiaoli  Z.  Fern,  CS  434:  Machine  Learning  and  Data  Mining,    School  of  Electrical  Engineering  

and  Computer  Science,  Oregon  State  University,  Fall    2008.  •  [DARPA,  2012]  DARPA  Machine  Reading  Program,  h?p://www.darpa.mil/Our_Work/I2O/Programs/

Machine_Reading.aspx.  •  [Mitchell,  2006]  Tom  M.  Mitchell,  The  Discipline  of  Machine  Learning,  my  perspecKve  on  this  research  

field,  July  2006  (h?p://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf).  •  [Mitchell,  1997]  Tom  M.  Mitchell,  Machine  Learning.  McGraw-­‐Hill,  1997.  •  [Etzioni  et  al.,  2007]  Oren  Etzioni,  Michele  Banko,  and  Michael  J.  Cafarella,  Machine  Reading.The  2007  

AAAI  Spring  Symposium.  Published  by  The  AAAI  Press,  Menlo  Park,  California,  2007.  •  [Clark  et  al.,  2007]  Peter  Clark,  Phil  Harrison,  John  Thompson,  Rick  Wojcik,  Tom  Jenkins,  David  Israel,  

Reading  to  Learn:  An  InvesKgaKon  into  Language  Understanding.  The  2007  AAAI  Spring  Symposium.  Published  by  The  AAAI  Press,  Menlo  Park,  California,  2007.  

•  [Norvig,  2007]  Peter  Norvig,    Inference  in  Text  Understanding.  The  2007  AAAI  Spring  Symposium.  Published  by  The  AAAI  Press,  Menlo  Park,  California,  2007.  

•  [Wang  &  Cohen,  2007]  Richard  C.  Wang  and  William  W.  Cohen:  Language-­‐Independent  Set  Expansion  of  Named  EnKKes  using  the  Web.  In  Proceedings  of  IEEE  Interna>onal  Conference  on  Data  Mining  (ICDM  2007),  Omaha,  NE,  USA.  2007.  

•  [Etzioni,  2008]  Oren  Etzioni.  2008.  Machine  reading  at  web  scale.  In  Proceedings  of  the  interna>onal  conference  on  Web  search  and  web  data  mining  (WSDM  '08).  ACM,  New  York,  NY,  USA,  2-­‐2.  

•  [Banko,  et  al.,  2007]  Michele  Banko,  Michael  J.  Cafarella,  Stephen  Soderland,  Ma?hew  Broadhead,  Oren  Etzioni:  Open  InformaKon  ExtracKon  from  the  Web.  IJCAI  2007:  2670-­‐2676  

IBERAMIA2012                                                                                    Machine  Learning,  Machine  Reading  and  the  Web                                                                                    Estevam  R.  Hruschka  Jr.  

Page 207: Machine Reading the Web

References  •  [Weikum  et  al.,  2009]  G.  Weikum,  G.,  Kasneci,  M.  Ramanath,  F.  Suchanek.  DB  &  IR  methods  for    •  knowledge  discovery.  CommunicaKons  of  the  ACM  52(4),  2009.  •  [Theobald  &  Weikum,  2012]  MarKn  Theobald  and  Gerhard  Weikum.  From  InformaKon  to  Knowledge:  

HarvesKng  EnKKes  and  RelaKonships  from  Web  Sources.  Tutorial  at  PODS  2012    •  [Hoffart  et  al.,  2012]  Johannes  Hoffart,  Fabian  Suchanek,  Klaus  Berberich,  Gerhard  Weikum.  YAGO2:  A  

SpaKally  and  Temporally  Enhanced  Knowledge  Base  from  Wikipedia.  Special  issue  of  the  ArKficial  Intelligence  Journal,  2012    

•  [Etzioni  et  al.,  2011]  Oren  Etzioni,  Anthony  Fader,  Janara  Christensen,  Stephen  Soderland,  and  Mausam  "Open  InformaKon  ExtracKon:  the  Second  GeneraKon“.    Proceedings  of  the  22nd  Interna>onal  Joint  Conference  on  Ar>ficial  Intelligence  (IJCAI  2011).  

•  [Hady  et  al.,  2011]  Hady  W.  Lauw,  Ralf  Schenkel,  Fabian  Suchanek,  MarKn  Theobald,  and  Gerhard  Weikum,  "SemanKc  Knowledge  Bases  from  Web  Sources"  at  IJCAI  2011,  Barcelona,  July  2011  

•  [Fader  et  al.,  2011]  Anthony  Fader,  Stephen  Soderland,  and  Oren  Etzioni.  "IdenKfying  RelaKons  for  Open  InformaKon  ExtracKon”.  Proceedings  of  the  2011  Conference  on  Empirical  Methods  in  Natural  Language  Processing  (EMNLP  2011)  

•  Se?les,  B.:  Closing  the  loop:  Fast,  interacKve  semi-­‐supervised  annotaKon  with  queries  on  features  and  instances.  In:  Proc.  of  the  EMNLP’11,  Edinburgh,  ACL  (2011)  1467–1478  5.    

•  Carlson,  A.,  Be?eridge,  J.,  Kisiel,  B.,  Se?les,  B.,  Jr.,  E.R.H.,  Mitchell,  T.M.:  Toward  an  architecture  for  never-­‐ending  language  learning.  In:  Proceedings  of  the  Twenty-­‐Fourth  Conference  on  ArKficial  Intelligence  (AAAI  2010).  

•  Pedro,  S.D.S.,  Hruschka  Jr.,  E.R.:  CollecKve  intelligence  as  a  source  for  machine  learning  self-­‐supervision.  In:  Proc.  of  the  4th  InternaKonal  Workshop  on  Web  Intelligence  and  CommuniKes.  WIC12,  NY,  USA,  ACM  (2012)  5:1–5:9  

IBERAMIA2012                                                                                    Machine  Learning,  Machine  Reading  and  the  Web                                                                                    Estevam  R.  Hruschka  Jr.  

Page 208: Machine Reading the Web

References  •  [Appel  &  Hruschka  Jr.,  2011]  Appel,  A.P.,  Hruschka  Jr.,  E.R.:  Prophet  –  a  link-­‐predictor  to  learn  new  

rules  on  Nell.  In:  Proceedings  of  the  2011  IEEE  11th  InternaKonal  Conference  on  Data  Mining  Workshops.  pp.  917–924.  ICDMW  ’11,  IEEE  Computer  Society,  Washington,  DC,  USA  (2011)  

•  [Mohamed  et  al.,  2011]  Mohamed,  T.P.,  Hruschka,  Jr.,  E.R.,  Mitchell,  T.M.:  Discovering  relaKons  between  noun  categories.  In:  Proceedings  of  the  Conference  on  Empirical  Methods  in  Nat-­‐  ural  Language  Processing.  pp.  1447–1455.  EMNLP  ’11,  AssociaKon  for  Computa-­‐  Konal  LinguisKcs,  Stroudsburg,  PA,  USA  (2011)  

•  [Pedro  &  Hruschka  Jr.,  2012]  Saulo  D.S.  Pedro  and  Estevam  R.  Hruschka  Jr.,  Conversing  Learning:  acKve  learning  and  acKve  social  interacKon  for  human  supervision  in  never-­‐ending  learning  systems.  Xiii  Ibero-­‐american  Conference  On  ArKficial  Intelligence,  IBERAMIA  2012,  2012.  

•  Krishnamurthy,  J.,  Mitchell,  T.M.:  Which  noun  phrases  denote  which  concepts.  In:  Proceedings  of  the  Forty  Ninth  Annual  MeeKng  of  the  AssociaKon  for  Compu-­‐  taKonal  LinguisKcs  (2011)  

•  Lao,  N.,  Mitchell,  T.,  Cohen,  W.W.:  Random  walk  inference  and  learning  in  a  large  scale  knowledge  base.  In:  Proceedings  of  the  2011  Conference  on  Empirical  Methods  in  Natural  Language  Processing.  pp.  529–539.  Associa-­‐  Kon  for  ComputaKonal  LinguisKcs,  Edinburgh,  Scotland,  UK.  (July  2011),  h?p://www.aclweb.org/anthology/D11-­‐1049  

•  E.  R.  Hruschka  Jr.  and  M.  C.  Duarte  and  M.  C.  Nicole�.  Coupling  as  Strategy  for  Reducing  Concept-­‐Drir  in  Never-­‐ending  Learning  Environments.  Fundamenta  InformaKcae,  IOS  Press,  2012.  

IBERAMIA2012                                                                                    Machine  Learning,  Machine  Reading  and  the  Web                                                                                    Estevam  R.  Hruschka  Jr.