How to Access Your Library Book Collections Using Solr

Post on 04-Dec-2014

879 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented by Engy Ali | The Library of Alexandria See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.

Transcript

5/14/12   h(p://dar.bibalex.org   1  

Accessing  Your  Library  Book  Collec5ons  Using  Solr  

By: Engy Morsy Software project manager, Bibliotheca Alexandrina

engy.morsy@bibalex.org  

 

BA  &  Solr  5/14/12   h(p://dar.bibalex.org   2  

h(p://bibalex.org  

5/14/12   h(p://dar.bibalex.org   3  

h(p://wamcp.bibalex.org  

5/14/12   h(p://dar.bibalex.org   4  

h(p://ssc.bibalex.org  

5/14/12   h(p://dar.bibalex.org   5  

h(p://dar.bibalex.org  

5/14/12   h(p://dar.bibalex.org   6  

Introductory  Video  

5/14/12   h(p://dar.bibalex.org   7  

Agenda  

•  Brief  introducFon  to  DAR  architecture  •  Indexing  books’  collecFon  •  Searching  across  Metadata  and  Content  •  FaceFng    •  Searching  Book  Content  •  Solr  with  personalizaFon  •  Future  •  Q&A  5/14/12   h(p://dar.bibalex.org   8  

About  1.5  Million  books  

5/14/12   h(p://dar.bibalex.org   9  

5/14/12   h(p://dar.bibalex.org   10  

Digital  Assets  Repository  

Digital  Assets  Repository  

5/14/12   h(p://dar.bibalex.org   11  

Book  site  

•  Approximately  260,000  books    •  Nearly  220,000    books  published  online    •  About  1.5  TB  of  content  •  Average  book  size  6  MB    •  Daily  indexing  rate  is  about  150  books.  

5/14/12   h(p://dar.bibalex.org   12  

What  do  we  want…?  

•  Allow  simple  and  advanced  search  across  metadata  and  content  in  5  languages  

5/14/12   h(p://dar.bibalex.org   13  

Simple  Search  

5/14/12   h(p://dar.bibalex.org   14  

What  do  we  want…?  

•  Allow  simple  and  advanced  search  across  metadata  and  content  in  5  languages  

•  FaceFng    

5/14/12   h(p://dar.bibalex.org   15  

What  do  we  want…?  

•  Allow  simple  and  advanced  search  across  metadata  and  content  in  5  languages  

•  FaceFng  •  AnnotaFons    

5/14/12   h(p://dar.bibalex.org   20  

Text  Underlining  

Text  Highligh5ng  

Adding  S5cky  Notes  

What  do  we  want…?  

•  Allow  simple  and  advanced  search  across  metadata  and  content  in  5  languages  

•  FaceFng  •  AnnotaFons  •  PersonalizaFon    

5/14/12   h(p://dar.bibalex.org   25  

Arranging  Books  in  Bookshelves  

SubmiIng  Comments  

Ra5ng  

Embedding  

Sharing  the  book  link  in  other  social  networks  

What  lies  beneath!!  

5/14/12   h(p://dar.bibalex.org   31  

Book  site  indices  

5/14/12   h(p://dar.bibalex.org   32  

AR  Index  

EN    Index  

FR  Index  

IT  Index  

SP  Index  

Query  

                         Indexing  Book  CollecFon  

•  Index  per  language  •  A  Document  in  the  content  index  correspond  to  a  page  in  a  book  

•  Maintain  a  field  to  disFnguish  between  metadata  record  and  content  record  (e.g.  SolrType)  

•  Use  staFc  fields  for  all  content  index  (e.g.  PageID..etc)  

5/14/12   h(p://dar.bibalex.org   33  

What  is  the  problem  with  this  solu5on?  

5/14/12   h(p://dar.bibalex.org   34  

Problem  for  content  search  

Example  :  Advanced  Search    search  for        Title:  Mobile  Technology      And        Content  :  “cloud  compuFng”  

5/14/12   h(p://dar.bibalex.org   35  

SolrType        Content  

SolrType      Meta  

Proposed  soluFon  

5/14/12   h(p://dar.bibalex.org   36  

Title:  Mobile  Technology  

Content  :  “cloud  compuFng”  

..  index  

..  index  

Get  intersecFon  

Result  IDs  

Facet  result  

Final  result  

Parent  Book  IDs  

..  index  

The  problem  is…  

•  Can’t  get  the  faceFng  result  directly  from  the  content  index  

•  Need  to  query  the  metadata  index  in  order  to  get  the  final  facet  result  

processing  Fme!!!  

5/14/12   h(p://dar.bibalex.org   37  

SoluFon…!  

•  Metadata  denormalizaFon  – Denormalize  metadata  into  content  index  

5/14/12   h(p://dar.bibalex.org   38  

SolrType        Content  

SolrType      Meta  

Proposed  soluFon  

5/14/12   h(p://dar.bibalex.org   39  

Title:  Mobile  Technology  

Content  :  “cloud  compuFng”  

..  index  

..  index  

Get  intersecFon  

Result  IDs  

Facet  result  

Final  result  

 Problem  for  content  search  

•  Metadata  denormalizaFon…..    

5/14/12   h(p://dar.bibalex.org   40  

Worst  choice!     •  Re-­‐indexing  for  changes  in  

metadata  •  Data  processing  is  required.  

 

New  Solu5on  

5/14/12   h(p://dar.bibalex.org   41  

Indexing  Metadata    

•  Index  per  language    •  Separate  content  and  metadata  index  •   Text  field  holds  the  whole  book  content  in  the  metadata  index  – The  maxFieldLength  has  been  set  to  maximum.  

•  e.g:  2147483647  

5/14/12   h(p://dar.bibalex.org   42  

Back  to  the  example  

Example  :  Advanced  Search    search  for        Title:  Mobile  Technology      And        Content  :  “cloud  compuFng”  

5/14/12   h(p://dar.bibalex.org   43  

SoluFon  

5/14/12   h(p://dar.bibalex.org   44  

Title:  Mobile  Technology  

Content  :  “cloud  compuFng”  

Meta  index  

Facet  result  

soluFon  

5/14/12   h(p://dar.bibalex.org   45  

Title:  Mobile  Technology  

Content  :  “cloud  compuFng”  

Meta  index  

Content  index  

Get  intersecFon  

Meta  index  

Facet  result  

   Separate  indexes  Vs.  All  in  one  

 •  Separate  indexes  

+  Indexing  Fme  +  Index  size  -­‐  Processing  results  (facets..)  -­‐  Scoring  

5/14/12   h(p://dar.bibalex.org   46  

   Separate  indexes  Vs.  All  in  one  

 •  Separate  indexes  

+  Indexing  Fme  +  Index  size  -­‐  Processing  results  (facets..)  -­‐  Scoring  

•  One  index  –  Index  size  –  Indexing  Fme  + Scoring  + Processing  Fme  

5/14/12   h(p://dar.bibalex.org   47  

Book  content  index  

5/14/12   h(p://dar.bibalex.org   48  

AR  Index  

EN    Index  

FR  Index  

IT  Index  

SP  Index  

5/14/12   h(p://dar.bibalex.org   49  

Searching  

•  Simple  and    advanced  search  – Cache  the  resulted  IDs  only  

•  HighlighFng  search  result  – Get  the  full  search  result  and  highlight  per  page  result  

 

 

5/14/12   h(p://dar.bibalex.org   50  

Book  Content  Search  

•  Search  using  – Search  query  – Book  ID  – List  of  pages’  IDs  

•  Highlights  •  AnnotaFons  – Saved  currently  in  DB  

5/14/12   h(p://dar.bibalex.org   51  

FaceFng  

•  Fixed  facet  fields    – Category,  sub-­‐category,  language..etc.  – Stored,  indexed,  exact  fields  

•  Process  facets  from  different  indices  

5/14/12   h(p://dar.bibalex.org   52  

PersonalizaFon  

•  Using  separate  index  of  personalizaFon    – Different  Solr  fields  for  different  languages.  – Search  across  all  fields.  

•  Saving  in  both  Solr  and  DB  •  Indexing  tags,  raFng  and  comments  using  type  field  

 

5/14/12   h(p://dar.bibalex.org   53  

Future  

•  Book  mobile  applicaFon  using  Solr  •  Using  Hadoop    •  Indexing  other  digital  media  (Maps,  audio,  video)  

5/14/12   h(p://dar.bibalex.org   54  

Contact    

   

engy.morsy  @bibalex.org  Library  website:  h(p://bibalex.org  

Digital  Asset  Repository:  h(p://dar.bibalex.org    

5/14/12   h(p://dar.bibalex.org   55  

5/14/12   h(p://dar.bibalex.org   56  

Thank  you…  

5/14/12   h(p://dar.bibalex.org   57  

top related