Top Banner
SYMPOSIUM ON BIAS AND DIVERSITY IN IR A TESTBED FOR DIVERSIFICATON IN SEARCH Koblenz, August 31, 2011 Michael Ma:hews, Barcelona Media/Yahoo! Research 1
53

SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

May 11, 2019

Download

Documents

volien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

SYMPOSIUM  ON  BIAS  AND  DIVERSITY  IN  IR    A  TESTBED  FOR  DIVERSIFICATON  IN  SEARCH    

Koblenz,  August  31,  2011  Michael  Ma:hews,  Barcelona  Media/Yahoo!  Research  

1  

Page 2: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

OVERVIEW  •  Introduc0on  to  LivingKnowledge  Testbed  –  The  Diversity  Engine  

•  GeAng  started  –  Our  first  applica0on!  •  Adding  text  analysis  •  Adding  mul0media  analysis  •  Evalua0on  •  Indexing  and  search  •  Developing  applica0ons  •  Future  work  

2  

Page 3: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

DIVERSITY  ENGINE  •  Provide  collec0ons,  annota0on  tools  and  an  evalua0on  framework  to  allow  for  collabora0ve  and  comparable  research  

•  Supports  indexing  and  searching  on  a  wide  variety  of  document  annota0ons  including  en00es,  bias,  trust,  polarity,  and  mul0media  features    

•  Support  development  of  bias  and  diversity  aware  applica0ons  

Page 4: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

ARCHITECTURE  

Document Collections

Analysis Pipeline

Index/ Search

Application Development

• Prediction  of  Community  Acceptance• Sentiment  in  Comments  ßà Comment  Ratings• Polarizing  Videos  ßà Distribution  of  Ratings• Topic  of  Videos  ßà Distribution  of  Ratings

Yahoo! News

ARC Crawls

NYT

Evaluation Framework

Page 5: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 DESIGN  DECISIONS  

•  Use  Open  Source  tools  when  available  •  Programming  Language  -­‐  Java  1.6  •  Data  format  –  LK  XML  •  Analysis  tools  Opera0ng  System  –  Linux  (any  so\ware  language)  

•  Indexing/Search  -­‐  Solr  •  GUI  –  JSP,  HTML,  JavaScript,  CSS  

5  

Page 6: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

LK-­‐XML  format.  

Page 7: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 DOCUMENT  COLLECTIONS  

•  Supported  Formats  -­‐ARC  (Internet  Memory  Crawls)  ,Text,  HTML.  Kyoto,  BBN,  NYT  

•  Collec0ons  – Tes0ng  Examples  included  with  Diversity  Engine  

– Large  ARCs  available  from  Internet  Memory  – Converters  provided  for  other  collec0ons  (MPQA,  BBN,  NYT)  that  have  licensing  restric0ons  

7  

Page 8: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 ANALYSIS  MODULES  

8  

Image  Annotation  Processing

Image  Processing Text  Processing

Text  Annotation  Processing

Face  Detection

Naturalness

Colourfulness

SIFT  Features

City/Landscape

Tone

Photomontage

Face  Tampering

Photo/Cartoon/CG Annotations

SentimentHistogram

Sentence  Subjectivity

Syntax  &  Semantics

POS

OpenNLP  Entities

SuperSense  Tagger

Vector  Quantisation

Dictionary

Phrases

Quotes

Disambiguated  Entities

Document  Layout

RDFa  Injection

Readability4J

TimeML

Statements

Subjective  Expressions

URLs

Wikipedia  People

Wikipedia  Places

EXIF Image  Clustering

Page 9: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 INDEXING/SEARCH  

•  Solr  –  Enterprise  search  pladorm  built  on  top  of  Lucene  –  Xml  input  and  output  allows  for  easy  integra0on  with  Diversity  Engine  

–  Plug-­‐in  framework  allows  customiza0on  –  Built-­‐in  facet  capabili0es  support  indexing  and  searching  on  annota0ons  

•  Integra0on  –  Converter  from  LK  XML  –  Solr  XML  –  Plug-­‐in  for  facet  ranking  and  speed  improvements  

9  

Page 10: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 APPLICATION  DEVELOPMENT  

10  

•  Basis  for  LivingKnowledge  Applica0ons  –  Future  Predictor  – Media  Content  Analysis  

•  Support  development  –  coding  required!  •  Real  World  Problems  

–  HTML  Extrac0on  –  Scaling  to  Large  Collec0ons  –  Provenance  –  Some  pluggable  GUI  components  –  Examples  to  ease  learning  curve  

 

Page 11: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 APPLICATION  DEVELOPMENT  

11  

Page 12: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 APPLICATION  DEVELOPMENT  

12  

Page 13: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

EVALUATION  FRAMEWORK  

•  Framework  for  the  evalua0on  of  analysis  tools  

•  Evaluates  any  possible  annota0on  pipeline  

•  Measures  correctness  and  quality  •  Outputs  Precision  +  Recall  •  Compares  annota0on  output  of  pipeline  with  ground  truth  data  

 

13  

Page 14: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 OUR  FIRST  APPLICATION  

•  Download  Diversity  Engine  release  from  SourceForge    •  tar  xzvf  [release  file]  •  cd  testbed  •  ant  build  •  apps/testbed  conf/testbed/tutorial-­‐applica0on.xml  •  What  happened?  

–  197  text  files  and  127  images  files  converted  from  arc  format  to  LK  XML  and  stored  in  devapps/example/data/lkxml  

–  2  annotators  were  run  over  collec0on  •  OpenNLP  for  tokeniza0on,  sentence  spliAng,  Pos  tags  •  SST  named  en0ty  recognizer  •  Results  stored  in  devapps/example/data/lkxml  

–  Files  were  converted  to  Solr  xml  format  and  indexed  using  solr  •  Solr  XML  stored  to  devapps/example/data/solr  

–  HTML  Visualiza0on  Files  stored  in  devapps/example/data/html  •  ant  deploy-­‐testbed  

–  Solr  running  at  hnp://localthost:8983/solr/  –  Example  app  running  at  hnp://localhost:8983/testbed/  

14  

Page 15: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 EXAMPLE  SOLR  OUTPUT  

15  

hnp://localhost:8983/solr/select/?q=pu0n  

Page 16: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 EXAMPLE  APPLICATION  

16  

hnp://localhost:8983/testbed/results.jsp?query=pu0n  

Page 17: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 EXAMPLE  DOCUMENT  

17  

Page 18: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 CONFIGURATION  FILE  

18  

<lk-application logDir="log" appDir="devapps/example"> <corpus dir="corpora/examples/smallarc" format="arc"/> <image-pipeline> <annotators> </annotators> </image-pipeline> <pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> </annotators> </pipeline> <visualize/> <indexer solrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/> <searcher appTitle="LivingKnowledge - Example Application"

appShortTitle="Example Application" appUrl="http://localhost:8983/solr/">

<facets> <facet field="per" description="Person"/> <facet field="loc" description="Location"/> </facets>

</searcher> </lk-application>

Page 19: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 TEXT  ANALYSIS  

19  

<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> </annotators> </pipeline>

<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> <annotator exec="./facts"/> <annotator exec="./unitn_tagger"/> <annotator exec="./unitn_subjexpr"/> </annotators> </pipeline>

apps/testbed –run pipeline conf/testbed/tutorial-application.xml apps/testbed –run visualization conf/testbed/tutorial-application.xml

Page 20: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 TEXT  ANALYSIS  -­‐  FACTS  

20  

devapps/example/data/lkxml/EA-­‐EUElecKons2009-­‐euobserver-­‐0729-­‐20090729085530-­‐00000.arc.15521713.facts.xml  

Page 21: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 TEXT  ANALYSIS  -­‐  FACTS  

21  

devapps/example/data/html/EA-­‐EUElecKons2009-­‐euobserver-­‐0729-­‐20090729085530-­‐00000.arc.15521713.html  

Page 22: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> <annotator exec="./facts"/> <annotator exec="./unitn_tagger"/> <annotator exec="./unitn_subjexpr"/> <annotator exec="./imageannots"/> </annotators> </pipeline>

 IMAGE  ANALYSIS  

22  

<image-pipeline> <annotators> <annotator exec="./soton_haarfacedetector"/> </annotators> </pipeline>

apps/testbed –run pipeline,image-pipeline –pipeline imageannots conf/testbed/tutorial-application.xml ls devapps/example/data/lkxml/img/*

Page 23: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 ANALYSIS  API  

•  Documents  in  LK  XML  format    •  Annotators  passed  a  single  document  directory  –They  should  add  

annota0ons  for  each  document  in  directory  •  Files  will  have  consistent  naming  conven0on  

–  LkText  file  =  id  +  “.lktext.xml”  –  LkMedia  =  id  +  “.lkmedia.xml”  –  LkAnnota0on  =  id  +  “.”  +  annotatorId  +  “.xml”  

•  Annotators  will  be  processed  sequen0ally  in  the  order  listed  in  the  XML  file  

•  Annotators  can  be  wrinen  in  any  language  but  must  run  on  Linux  –  Helper  classes  will  exist  for  Java,  but  there  is  no  obliga0on  to  use  them.  

•  Add  applica0on  calling  your  new  annotator  to  apps  directory  •  Add  your  applica0on  to  the  configura0on  file  as  before  

23  

Page 24: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 ANALYSIS  API  –  JAVA  

•  Extend  class  org.diversityengine.annotator.AbstractAnnotator  •  Implement  Methods  

–  getName()  –  getType()  -­‐  TEXT  OR  IMAGE  

•  For  Image  Analysis  implement  –  LkAnnota0on  getLkAnnota0on(ImageDocument  document)  

•  For  Text  Analysis  implement  –  LkAnnota0on  getLkAnnota0on(TextDocument  document)  

•  In  main,  instan0ate  and  call  annotator  –  NewAnnotator  annotator  =  new  NewAnnotator()  –  annotator.processDirectory(args[0]);  

•  Add  applica0on  calling  your  new  annotator  to  apps  directory  •  Add  your  applica0on  to  the  configura0on  file  as  before  

24  

Page 25: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

EVALUATION  

25  

<lk-application logDir="log" appDir="devapps/evaluation"> <corpus dir="corpora/evaluation/sst/text/" format="bbn"/> <pipeline>

<annotators> <annotator exec="./sst"/> </annotators>

</pipeline> <evaluation evalDir="evaluation/sst/"> <evaluator provides="ENTITIES"

goldDir="corpora/evaluation/sst/gold/" goldAnnotator="sstgold" annotator="sst" />

</evaluation> </lk-application>

Evalua0on  works  with  same  configura0on  file.  Simply  add  evalua0on  element  

apps/testbed conf/evaluation/sst.xml

Page 26: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

EVALUATION  RESULTS  

26  

<evaluation goldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/" lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml" annotation="sst" goldAnnotation="sstgold" provides="ENTITIES"> <docs> <doc id="WSJ0375" N="19" tp="18" fp="1" fn="1" /> <doc id="WSJ0380" N="19" tp="15" fp="4" fn="1" /> <doc id="WSJ0376" N="72" tp="61" fp="11" fn="7" /> <doc id="WSJ0377" N="26" tp="17" fp="9" fn="6" /> <doc id="WSJ0378" N="10" tp="10" fp="0" fn="0" /> <doc id="WSJ0379" N="24" tp="19" fp="5" fn="2" /> </docs> <totals N="170" tp="140" fp="30" fn="17" p="0.8235294117647058" r="0.89171974522293" f="0.8562691131498471" /> </evaluation>

cat evaluation/sst/sst.ENTITIES.xml

Page 27: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 INDEXING  AND  SEARCH  

•  Search  Engines  -­‐  Tradi0onal  –  Bag-­‐of-­‐words  representa0on  –  Inverted  index  (words  -­‐>  documents)  for  efficiency  –  10  docs  ranked  according  d-­‐idf  similarity  with  query  

•  Search  Engines  –  Today  –  Much  metadata  associated  with  documents  –  Ranking  based  on  100s  of  features  (date,  loca0on,  pagerank,  

click  data,  etc,  personaliza0on)  –  Richer  display  

•  Facets  for  exploratory  search  •  Answers  when  appropriate  •  etc..  

–  Many  open  source  op0ons  -­‐  Lucene/Solr  most  widely  used  

27  

Page 28: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 APACHE  LUCENE/SOLR  

28  

Lucene/Solr  

Page 29: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 FACETED  SEARCH  

29  Diagram  by  Yonik  Seeley  

Page 30: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

FACETED  SEACH  

30  

•  Summarize  query  results  aggrega0on  proper0es  of  returned  pages  –  price  ranges  for  product  query  –  related  people  or  loca0ons  for  news  query  

•  Exploratory  Search  –  Show  documents  that  matching  the  query  term  and  a  selected  

facet  –  Make  inferences  not  clear  from  simple  document  list  

•  Living  Knowledge  Analysis  is  modeled  very  well  by  facets  –  Topics  as  determined  by  en0ty  and  fact  extrac0on  –  Loca0on  and  Time  diversity  dimensions  –  Opinions  as  determined  by  opinion  extrac0on  

Page 31: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

LK  XML  TO  SOLR  

31  

•  Solr  has  well  defined  XML  input  format  for  adding  new  documents  

•  Diversity  Engine  provides  a  simple  language  to  map  LX  XML  to  Solr  XML  

Page 32: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

LK2SOLR  CONVERSION  

32  

<lktosolr>

<field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/>

</lktosolr>

solr – Name of the field in solr annotation – Name of the LKXML Annotation value – Value of annotation filter – Allows post processing on annotation type – Only Date supported currently

<indexer solrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/>

Page 33: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

ADDING  FACTS  TO  INDEX  

33  

<lktosolr>

<field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="yago" annotation="yago-entities" value="$text" /> <field solr="yago-country" annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" />

</lktosolr>

apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml

Page 34: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

FACTS  TO  SOLR  

34  

<field solr="yago" annotation="yago-entities" value="$text" />

Page 35: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

FACTS  TO  SOLR  

35  

<field solr="yago-country" annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" />

Page 36: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

ADDING  IMAGES  TO  INDEX  

36  

<lktosolr> <field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="yago" annotation="yago-entities" value="$text" /> <field solr="yago-country" annotation="facts" value="xpath:/entityinformation[facts/type/text() ='wordnet_country_108544813']/id/text()" />

<field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="image" annotation="IMAGE_ANNOTS" value="$text" /> <field solr="bestimage" annotation="BEST_IMAGES" value="$text" />

</lktosolr>

apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml

Page 37: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 APPLICATION  DEVELOPMENT  

•  Examples  • HTML  Extrac0on  •  Scaling  to  Large  Collec0ons  •  Provenance  •  Some  pluggable  GUI  components  

37  

Page 38: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 FACT/IMAGE  APPLICATION  

38  

<searcher appTitle="LivingKnowledge - Example Application" appShortTitle="Example Application" appUrl="http://localhost:8983/solr/">

<facets> <facet field=“yago" description=“Yago"/> <facet field=“yago-country" description=“Country"/>

<facet field="per" description="Person"/> <facet field="loc" description="Location"/> <facet field=“image" description=“Images"/> </facets>

</searcher>

ant deploy-testbed

Page 39: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 FACT/IMAGE  APPLICATION  

39  

hnp://localhost:8983/testbed/results.jsp?query=pu0n  

Page 40: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 OPINION  APPLICATION  Opinions  are  at  sentence  level,  not  document  level  –  same  analysis,  but  different  

indexing  cat  conf/testbed/tutorial-­‐lk2solr-­‐sentence.xml  

40  

<lktosolr solrDoc="SENTENCES" contextSize="1"> <field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter“ source="solrdoc" /> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter“ source="solrdoc" /> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="yago" annotation="yago-entities" value="$text“ source="solrdoc" /> <field solr="image" annotation="IMAGE_ANNOTS" value="$text" /> <field solr="bestimage" annotation="BEST_IMAGES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="polarity" annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:/node()[@pol]/@pol" source="solrdoc“ filter="org.diversityengine.solr.converter.filters.PolarityValueFilter"/> <field solr="pol-int“ annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:concat(/node()[@pol and @int]/@pol,/node()[@int and @pol]/@int)“ source="solrdoc"/>

</lktosolr>

apps/testbed –run convert-solr,index conf/testbed/tutorial-application-sentence.xml

ls devapps/example/data/solr/*

Page 41: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 SOLR  XML  –  SENTENCE  

41  

Page 42: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

OPINION  APPLICATION  

42  

<web-app xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> <description> LivingKnowledge Testbed Example Application </description> <display-name>Testbed Examples</display-name>

<context-param> <param-name>applicationDef</param-name>

<param-value>conf/testbed/tutorial-application-sentence.xml</param-value>

<description>The Living Knowledge application description XML file </description> </context-param> </web-app>

ant deploy-testbed

modify webapp\WEB-INF\web.xml

Page 43: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 OPINION  APPLICATION  

43  

hnp://localhost:8983/testbed/results.jsp?query=pu0n  

Page 44: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 HTML  EXTRACTION  

44  

Main  Article Other  StuffHeadline

Page 45: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

HTML  EXTRACTION  

•  Boilerplate  can  lead  to  false  posi0ve  results  and  inaccurate  facet  aggrega0on  – Real  example  –  before  extrac0on  developed,  most  common  person  for  most  queries  was  in  a  top  story  0tle  (on  all  pages)  the  day  of  the  crawl!  

•  Titles,  Authors  and  Dates  are  important  for  bias  and  diversity  aware  search  

45  

Page 46: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 PROVENANCE  

•  How  an  annota0on  is  derived  is  o\en  as  important  as  the  annota0on  itself  – Users  want  to  verify  results  – Developers  need  to  validate  results  

•  Open  Provenance  provides  an  open  source  solu0on  

•  Testbed  annota0ons  can  be  extended  with  Open  Provenance  chains  

46  

Page 47: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

Provenance  Diagram  

47  

Page 48: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 SCALING  TO  LARGE  COLLECTIONS  

•  In  the  real  world,  even  “small”  datasets  have  million  of  documents  

•  NLP/Image  processing  is  expensive  –  1  doc/sec  =  11  days  for  1  million  docs!  

•  Hadoop  Mapper  allows  for  scaling  –  scales  linearly  with  number  of  machines  

•  ZipCollec0on  writer  allows  par00oning  data  into  subsets  for  processing  

48  

Page 49: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

COMPONENTS-­‐  OPINIONS  

49  

Page 50: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

COMPONENTS  -­‐  TIME  

50  

Page 51: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

COMPONENTS  -­‐  GEO  

51  

Page 52: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

 FUTURE  WORK  

•  More  components    •  Maven  to  manage  dependencies  •  Bener  integra0on  of  Timeline  and  Geo  visualiza0on  components  

•  Integra0on  of  ranking  algorithms  •  Bener  Documenta0on  J  

52  

Page 53: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR - ESSIR 2011essir.uni-koblenz.de/slides/div_part3.pdfSYMPOSIUMON(BIAS(AND(DIVERSITY(IN(IR (ATESTBED(FORDIVERSIFICATON(IN ... – Solr"running"at

Thanks!  

• LivingKnowledge  Partners!  • You  for  coming!!  • Ques0ons?  

53