Top Banner
CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian {liuqing, yewang16, anusha89, ketian} @vt.edu Instructor: Dr. Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA, 24061 December 6, 2016
29

CS 5604 Information Storage and Retrieval Solr Team Final ...

Feb 14, 2017

Download

Documents

vodat
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 5604 Information Storage and Retrieval Solr Team Final ...

CS 5604 Information Storage and RetrievalSolr Team Final Presentation

Presenters:Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian

{liuqing, yewang16, anusha89, ketian} @vt.edu

Instructor: Dr. Edward A. Fox

Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061

December 6, 2016

Page 2: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• Background• Implementation• ProblemsFaced• LessonsLearned• FutureWork• Acknowledgement

Outline

1

Page 3: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Background — Overview

2

Page 4: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Background — Updates

3

Spring 2016 Fall 2016

schema.xml

Coarsegrained Finegrained

Nocopyfields Copyfields forallfieldssearch

Createstopwords.txt &profanity.txt Updatethetwofiles

morphlines.conf

Twofieldtypes:stringandtext Multiplefieldtypes

Field“time”=>string Field“time”=>datetime

Nomultiple-valuedfields Multiple-valuedfield parser

Basic Indexing Smallcollection 1.2billiontweetsdataset

Incremental Indexing VirtualCloudera(VC) VC &HadoopCluster(HC)

Recommendation Brief description ImplementedinVC&HC

Custom Ranking Brief description ImplementedinVC&HC

Solr Admin UIBrief description Detaileddescription

Limitedfacetedsearch Detailedfacetedsearch

Page 5: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• LiveMode• ContinuousstreamofHBase cellupdatesintolivesearchindexers

• Simpleandefficient• Cannothandlebigdata

• BatchMode• BatchindextablesinHBase byusingMapReducejobs• WriteindexfilesintoHDFS(/user/cs5604f16_solr/…)• Canhandlebigdata

Implementation — Basic Indexing

4

Page 6: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• schema.xml:fieldsconfiguration• field(e.g.,ideal-cs5604f16-fake)

• #offields:30• Types:string(22),text_general (2),int (2),float(2),long(1),date(1)• Stored:True(17),False(13)

• dynamicField:matchingmultiplefields,usingwildcard

• copyField

Implementation — Basic Indexing

5

Page 7: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• stopword.txtandprofanity.txt• stopword.txt:tf-idf valuewillnotbecalculated• profanity.txt:quickresponseforsuchsearchqueries• Solr loadsthetwofileswhilereadingschema.xml

Implementation — Basic Indexing

6

Source:https://pypi.python.org/pypi/many-stop-wordshttp://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/

Page 8: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• morphlines.conf:mappingandparsing

Implementation — Basic Indexing

7

MappingdatafromHBase toSolr

Splitmultiplevaluesintolist "topic_label_s": "twitter;social;media;text"

Page 9: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• Indexthebigdataset

Implementation — Basic Indexing

8

ideal-cs5604f16 ideal-cs5604f16-1204

Dataset Allcollections(rawtweets)

Allcollections(rawtweets+processeddata)

Indexing

# of DataNode 18 17

Space Cost 392.33GB 399.21GB

Time Cost

Mapping 1h21m 1h45m

Reducing 5h11m 5h13m

Merging 3h18m 3h10m

Total 9h50m 10h8m

Page 10: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• Purpose• ProcessacontinuousstreamofHBase cellupdatesintolivesearchindexes(NearReal-Time,NRTIndexing)

• Solvetheproblemoffrequentinserts,deletesandupdates

• Howdoesitwork?• EnablingHBase replication(columnfamily)• PointinganNRTIndexerServiceatanHBase table• StartinganNRTIndexerService

• Ourwork

Implementation — Incremental Indexing

8

Source:http://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_config_hbase_indexer_for_search.html

Page 11: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Incremental Indexing

CreateandchecktheNRTindexer

9

Page 12: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

RestarttheHBase Solr Indexerservice

Implementation — Incremental Indexing

RestarttheserviceinVC

RestarttheserviceinHC

10

Page 13: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Incremental Indexing

11

CreateandchecktheNRTindexerChecktheresultsinHBase andSolr AdminUI

Page 14: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• Types• Textualsimilaritybased• Collaborativefiltering

• MoreLikeThisComponent• Identifiessimilardocumentstosearchresultdocuments.• Canbeconfiguredasarequesthandlerorsearchcomponent

• Usestermvectorstocomputesimilarity.• Termvectorcanbecalculatedduringqueryruntimeorprecomputedduringindexing

• Extractshighestmatchingtermsbasedontf-idf similarity

Implementation — Recommendation

12

Page 15: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• schema.xml• Setstored=true• SettermVectors =true(forcalcalating tf-idf)

• Aftermakingchanges,reindexing ismandatory

• solrconfig.xml• Enablemlt

• Defineotherconfigurationparameters• e.g.,mlt.fl,mlt.mintf,mlt.mindf,mlt.maxdf,mlt.qf

Implementation — Recommendation

13

Page 16: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• RequestHandler

Implementation — Recommendation

Link:https://drive.google.com/open?id=0B2iasHDgHqGyYUk0R3RkVktkM2M 14

Page 17: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

• SearchComponent

Implementation — Recommendation

Link:https://drive.google.com/open?id=0B2iasHDgHqGyU0doVEpidlh3c2c 15

Page 18: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Custom Ranking

16

• Purpose• Customizeandoptimizetherankedresults

• Howdoesitwork?• SearchComponent

• prepare():pre-processing,invokedbeforequeryisexecuted• processing():post-processing,invokedafteralltheresultsarefetched

• CustomScoring

• Re-ranking

𝑺𝒄𝒐𝒓𝒆 = 𝑫𝒐𝒄𝒔𝒄𝒐𝒓𝒆,𝑺𝒐𝒍𝒓 + 𝑫𝒐𝒄𝒊𝒎𝒑𝒐𝒓𝒕𝒂𝒏𝒄𝒆+𝑊45678×𝐷𝑜𝑐=85>?,45678 + 𝑊8@A=4?>×𝐷𝑜𝑐=85>?,8@A=4?>

Page 19: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Custom Ranking

BuildandcopyjarfileintoHadoopCluster

16

Page 20: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Custom Ranking

BuildandcopyjarfileintoHadoopCluster

16

Modifythesolrconfig.xml

Page 21: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Custom Ranking

17

UpdatetheinstanceDirReloadthecollectionChecktheresultsinSolr AdminUI

Page 22: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Solr Admin UI

1

2

3

Choose ideal-cs5604f16-fake for querying

DashBoard:providebasicfunctionsforuserstochoose.(LoggingtocheckSolrlogsfordebugging)

CoreSelector:selectthecore(dataset)forqueries

Solr instanceInformation:currentversions,JVMinformation

19

Page 23: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Solr Admin UI

22

1

2

4 5

3

Fieldname

Resultstatistics

Therequest-handler:/selectThequeryevent:qParametersforquery:fq (filterqueries)sort(descendingorascending)ExecutequeryResultsoutputs:json format

Page 24: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Implementation — Solr Admin UI

23

1

2

4

3

5

Thefacetedsearchquery:rangeFacetedsearchfield:t_month_iParameters,truewhenenabledSearchResults:countsSearchResults:details

Page 25: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Problem Faced

24

ClouderaandOSVirtualClouderaseems slowandoftencrashesduetothememory

Notfamiliar withthewholearchitectureatthebeginning

VersionsofClouderaandSolr

DataConsistencycheck

Notenoughrealdataavailabletoperformtests

Notmuchinformationavailableregardinglogstoperformcollaborativefiltering

CollaborationCommunicationandmodification

Page 26: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Lessons Learned

25

SolrHBase

HDFS

Patience

Carefulness

TeamCollaboration

Page 27: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Future Work

26

SearchCustomizemorerequesthandlers

Dealwiththeprofanityissue

CustomRankingCustomizemoresearchcomponents

Recommendation

Createacustomrecommendationcomponent(Probabilities– CTAteam)

Implementthecollaborativefiltering(Log files– FEteam)

SolrFigureoutSolrCloud,multipleSolr nodesinClouderaSearch

Page 28: CS 5604 Information Storage and Retrieval Solr Team Final ...

Solr Team Final Presentation

Acknowledgement

27

Projects

NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)

NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)

TeamsCMT,CMW,CLA,CTA,FEteams

PersonsInstructor Dr.EdwardA.Fox

GRA Sunshin Lee

Page 29: CS 5604 Information Storage and Retrieval Solr Team Final ...

Thank you !

Questions?