Top Banner
© FINDWISE 2012 Introducing Hydra An Open Source Document Processing Framework Joel Westberg
19

Introducing Hydra – An Open Source Document Processing Framework

May 31, 2015

Download

Technology

Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introducing Hydra – An Open Source Document Processing Framework

©  FINDWISE  2012  

Introducing Hydra !An Open Source Document Processing Framework!

Joel  Westberg  

Page 2: Introducing Hydra – An Open Source Document Processing Framework

Tänk  på  följande:    •   Skriv  ej  långa  texter    •   Fly?a  ej  textrutor  •   Skapa  luBiga  bilder  •   Ändra  ej  typsni?  (HelveGca  och  Calibri  finns  på  MOSSEN)  

 

 •  Founded  in  2005  

•  Offices  in  Sweden,  Denmark,                    Norway  and  Poland  

•  80  employees  (April  2012)      

 •  Our  objecGve  is  to  be  a  leading  provider  of  Findability  soluGons  uGlising  

the  full  potenGal  of  search  technology  to  create  customer  business  value      

About Findwise

Page 3: Introducing Hydra – An Open Source Document Processing Framework

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Page 4: Introducing Hydra – An Open Source Document Processing Framework

Tänk  på  följande:    •   Skriv  ej  långa  texter    •   Fly?a  ej  textrutor  •   Skapa  luBiga  bilder  •   Ändra  ej  typsni?  (HelveGca  och  Calibri  finns  på  MOSSEN)  

 

Technology independent! CreaGng  search-­‐driven  Findability  soluGons  based  on  market-­‐leading    commercial  and  open  source  search  technology  plaYorms:  

 

 

   

ü     Autonomy  IDOL  ü     MicrosoB  (SharePoint  and  FAST  Search  products)  

ü     Google  GSA      

ü     IBM  ICA/OmniFind        

ü  LucidWorks  

ü     Apache  Lucene/Solr    

ü     ElasGc  Search  ü     and  more…  

Page 5: Introducing Hydra – An Open Source Document Processing Framework

Generic Search Architecture !

Page 6: Introducing Hydra – An Open Source Document Processing Framework

Connecting source to search!

Garbage  in,  garbage  out.  But  what  about  unstructured  data  in?  

•  Flat  data  is  richer  than  it  appears  

•  Don’t  discard  informaGon  too  soon!  

 

The  unstructured  structured  data  paradox  

Example:  News  arGcles    

Plain  text  that  contains  invaluable  metadata  for  search,  such  as:    

•  Title  

•  Author  byline  

•  Lead  paragraph  

Page 7: Introducing Hydra – An Open Source Document Processing Framework

Enrichment and structuring possibilities!

•  Enrich  your  documents  with  metadata,  to  power  your  search  •  Language detection

•  Sentiment analysis

•  Headline extraction

•  Regular expression matching and extraction

•  Filter  out  unwanted  documents  

•  Collect  staGsGcs  

•  Export  to  Staging  environments  

Page 8: Introducing Hydra – An Open Source Document Processing Framework

Classic Pipeline!

Page 9: Introducing Hydra – An Open Source Document Processing Framework

Classic Architecture!

Page 10: Introducing Hydra – An Open Source Document Processing Framework

The Hydra Architecture!

Page 11: Introducing Hydra – An Open Source Document Processing Framework

Main Design Objectives!Scalability    

•  Horizontally scalable central repository

•  Independent processing nodes

Failiure  tolerant  •  Failiure of a stage affects only a single document

•  Failiure of a node affects at most n documents

•  Failiures can be automaticly detected

Robustness  •  Independent stages

Development  ease  •  Debug stages from IDE against actual data

•  Allow test driven pipeline development

Page 12: Introducing Hydra – An Open Source Document Processing Framework

The Hydra Architecture!

Page 13: Introducing Hydra – An Open Source Document Processing Framework

Writing a Stage - Example!@Stage(descripGon="This  is  a  Simple  Writer")  

public  class  SimpleWriter  extends  AbstractProcessStage  {  

 @Parameter(descripGon="Name  of  field  to  write  value  to")  

 private  String  field;  

 @Parameter(descripGon="Value  to  write")  

 private  Object  value;  

   

 @Override  

 public  void  process(LocalDocument  doc)  throws  ProcessExcepGon  {  

   doc.putContentField(field,  value);  

 }  

   

 @Override  

 public  void  init()  throws  RequiredArgumentMissingExcepGon  {  

   if(field==null)  throw  new  RequiredArgumentMissingExcepGon("field  is  missing");  

 }  

}  

Page 14: Introducing Hydra – An Open Source Document Processing Framework

Hadoop/Big Data integration!  

Usecases  for  document  enrichment  •  Pagerank  •  AnalyGcs  Hadoop  &  Map/Reduce  advantages  •  Huge  scalability  •  Ability  to  work  on  enGre  document  set  at  once  

Hadoop  &  Map/Reduce  drawbacks  •  Batch  processing  •  Time-­‐to-­‐index  

Page 15: Introducing Hydra – An Open Source Document Processing Framework

Hadoop/Big Data integration!    

Blue – First round of indexing only Red – Second round of indexing Purple – All documents

Page 16: Introducing Hydra – An Open Source Document Processing Framework

Future Configuration UI!

Page 17: Introducing Hydra – An Open Source Document Processing Framework

Open Source initiative!

•  Other  commi?ers  

•  The  role  of  Findwise  

For  more  informaNon:  

•  h?p://www.findwise.com/hydra  

•  h?p://findwise.github.com/Hydra  

•  Email:  [email protected]  

Page 18: Introducing Hydra – An Open Source Document Processing Framework

Questions?!

Page 19: Introducing Hydra – An Open Source Document Processing Framework

Joel  Westberg  [email protected]  

Thankyou!!