Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

LuceneTutorial

borrowingfrom:ChrisManningandPanduNayak

IntroductiontoInformationRetrieval

OpensourceIRsystems▪ Widelyusedacademicsystems▪ Terrier(Java,U.Glasgow)http://terrier.org▪ Indri/Galago/Lemur(C++(&Java),U.Mass&CMU)▪ Tailofothers(Zettair,…)

▪ Widelyusednon-academicopensourcesystems▪ Lucene

▪ Thingsbuiltonit:Solr,ElasticSearch

▪ Afewothers(Xapian,…)

http://terrier.org

http://terrier.org


Lucene▪ OpensourceJavalibraryforindexingandsearching▪ Letsyouaddsearchtoyourapplication▪ Notacompletesearchsystembyitself▪ WrittenbyDougCutting

▪ Usedby:Twitter,LinkedIn,Zappos,CiteSeer,Eclipse,…▪ …andmanymore(seehttp://wiki.apache.org/lucene-java/PoweredBy)

▪ Ports/integrationstootherlanguages▪ C/C++,C#,Ruby,Perl,Python,PHP,…

http://wiki.apache.org/lucene-java/PoweredBy


Basedon“LuceneinAction”ByMichaelMcCandless,ErikHatcher,OtisGospodnetic

CoversLucene3.0.1.It’snowupto5.3.1


Resources▪ Lucene:http://lucene.apache.org

▪ LuceneinAction:http://www.manning.com/hatcher3/▪ Codesamplesavailablefordownload

▪ Ant:http://ant.apache.org/▪ Javabuildsystemusedby“LuceneinAction”code

http://www.manning.com/hatcher3/

http://ant.apache.org/


Luceneinasearchsystem

RawContent

Acquirecontent

Builddocument

Analyzedocument

Indexdocument

Index

Users

SearchUI

Buildquery

Renderresults

Runquery


Lucenedemos▪ Sourcefilesinlia2e/src/lia/meetlucene/

▪ ActualsourcesuseLucene3.0.1▪ CodeintheseslidesupgradedtoLucene5

▪ CommandlineIndexer▪ lia.meetlucene.Indexer

▪ CommandlineSearcher▪ lia.meetlucene.Searcher


Coreindexingclasses▪ IndexWriter▪ Centralcomponentthatallowsyoutocreateanewindex,openanexistingone,andadd,remove,orupdatedocumentsinanindex

▪ BuiltonanIndexWriterConfigandaDirectory

▪ Directory▪ Abstractclassthatrepresentsthelocationofanindex

▪ Analyzer▪ Extractstokensfromatextstream


CreatinganIndexWriterImport org.apache.lucene.analysis.Analyzer;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.store.Directory;...

private IndexWriter writer;

public Indexer(String dir) throws IOException { Directory indexDir = FSDirectory.open(new File(dir)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig cfg = new IndexWriterConfig(analyzer); cfg.setOpenMode(OpenMode.CREATE); writer = new IndexWriter(indexDir, cfg)}


Coreindexingclasses(contd.)▪ Document▪ RepresentsacollectionofnamedFields.TextintheseFieldsareindexed.

▪ Field▪ Note:LuceneFieldscanrepresentboth“fields”and“zones”asdescribedinthetextbook

▪ Orevenotherthingslikenumbers.▪ StringFieldsareindexedbutnottokenized▪ TextFieldsareindexedandtokenized


ADocumentcontainsFieldsimport org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;...protected Document getDocument(File f) throws Exception {

Document doc = new Document();

doc.add(new TextField("contents”, new FileReader(f))) doc.add(new StringField("filename”, f.getName(),

Field.Store.YES)); doc.add(new StringField("fullpath”,

f.getCanonicalPath(),

Field.Store.YES)); return doc;}


IndexaDocumentwithIndexWriter

private IndexWriter writer;...private void indexFile(File f) throws

Exception {Document doc = getDocument(f);writer.addDocument(doc);

}


Indexingadirectoryprivate IndexWriter writer;...public int index(String dataDir,

FileFilter filter)throws Exception {

File[] files = new File(dataDir).listFiles();for (File f: files) {

if (... && (filter == null || filter.accept(f))) {indexFile(f);

}}return writer.numDocs();

}


ClosingtheIndexWriter

private IndexWriter writer;...public void close() throws IOException {

writer.close();}


TheIndex▪ TheIndexisthekindofinvertedindexweknowandlove

▪ ThedefaultLucene50codecis:▪ variable-byteandfixed-widthencodingofdeltavalues▪ multi-levelskiplists▪ naturalorderingofdocIDs▪ encodesbothtermfrequenciesandpositionalinformation

▪ APIstocustomizethecodec


Coresearchingclasses▪ IndexSearcher▪ Centralclassthatexposesseveralsearchmethodsonanindex▪ AccessedviaanIndexReader

▪ Query▪ Abstractqueryclass.Concretesubclassesrepresentspecifictypesofqueries,e.g.,matchingtermsinfields,booleanqueries,phrasequeries,…

▪ QueryParser▪ ParsesatextualrepresentationofaqueryintoaQuery instance


IndexSearcher

IndexSearcher

IndexReader

Directory

Query TopDocs


CreatinganIndexSearcher

import org.apache.lucene.search.IndexSearcher;...public static void search(String indexDir,

String q)throws IOException, ParseException {

IndexReader rdr =DirectoryReader.open(FSDirectory.open(

new File(indexDir)));IndexSearcher is = new IndexSearcher(rdr);...

}


QueryandQueryParserimport org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Query;...public static void search(String indexDir, String q)

throws IOException, ParseException ...QueryParser parser =

new QueryParser("contents”, new StandardAnalyzer());

Query query = parser.parse(q);...

}


Coresearchingclasses(contd.)▪ TopDocs▪ Containsreferencestothetopdocumentsreturnedbyasearch

▪ ScoreDoc▪ Representsasinglesearchresult


search()returnsTopDocsimport org.apache.lucene.search.TopDocs;...public static void search(String indexDir,

String q)throws IOException, ParseException

...IndexSearcher is = ...;...Query query = ...;...TopDocs hits = is.search(query, 10);

}


TopDocscontainScoreDocsimport org.apache.lucene.search.ScoreDoc;...public static void search(String indexDir, String q)

throws IOException, ParseException ...IndexSearcher is = ...;...TopDocs hits = ...;...for(ScoreDoc scoreDoc : hits.scoreDocs) {

Document doc = is.doc(scoreDoc.doc);System.out.println(doc.get("fullpath"));

}}


ClosingIndexSearcher

public static void search(String indexDir, String q)

throws IOException, ParseException ...IndexSearcher is = ...;...is.close();

}


HowLucenemodelscontent▪ ADocumentistheatomicunitofindexingandsearching▪ ADocumentcontainsFields

▪ Fieldshaveanameandavalue▪ YouhavetotranslaterawcontentintoFields▪ Examples:Title,author,date,abstract,body,URL,keywords,...

▪ Differentdocumentscanhavedifferentfields▪ Searchafieldusingname:term,e.g.,title:lucene


Fields▪ Fieldsmay▪ Beindexedornot

▪ Indexedfieldsmayormaynotbeanalyzed(i.e.,tokenizedwithanAnalyzer)▪ Non-analyzedfieldsviewtheentirevalueasasingletoken(usefulforURLs,paths,dates,socialsecuritynumbers,...)

▪ Bestoredornot▪ Usefulforfieldsthatyou’dliketodisplaytousers


Fieldconstruction Lotsofdifferentconstructorsimport org.apache.lucene.document.Fieldimport org.apache.lucene.document.FieldType

Field(String name, String value, FieldType type);

value canalsobespecifiedwithaReader,aTokenStream,orabyte[].

FieldTypespecifiesfieldproperties.

Canalsodirectlyusesub-classeslikeTextField,StringField,…


UsingField propertiesIndex Store Exampleusage

NOT_ANALYZED YES Identifiers,telephone/SSNs,URLs,dates,...

ANALYZED YES Title,abstract

ANALYZED NO Body

NO YES Documenttype,DBkeys(ifnotusedforsearching)

NOT_ANALYZED NO Hiddenkeywords


Analyzer

▪ Tokenizestheinputtext▪ CommonAnalyzers▪ WhitespaceAnalyzer Splitstokensonwhitespace

▪ SimpleAnalyzer Splitstokensonnon-letters,andthenlowercases

▪ StopAnalyzer SameasSimpleAnalyzer,butalsoremovesstopwords

▪ StandardAnalyzer Mostsophisticatedanalyzerthatknowsaboutcertaintokentypes,lowercases,removesstopwords,...


Analysisexample▪ “Thequickbrownfoxjumpedoverthelazydog”▪ WhitespaceAnalyzer▪ [The][quick][brown][fox][jumped][over][the][lazy][dog]

▪ SimpleAnalyzer▪ [the][quick][brown][fox][jumped][over][the][lazy][dog]

▪ StopAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]

▪ StandardAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]


Anotheranalysisexample▪ “XY&ZCorporation–[email protected]”▪ WhitespaceAnalyzer▪ [XY&Z][Corporation][-][[email protected]]

▪ SimpleAnalyzer▪ [xy][z][corporation][xyz][example][com]

▪ StopAnalyzer▪ [xy][z][corporation][xyz][example][com]

▪ StandardAnalyzer▪ [xy&z][corporation][[email protected]]


What’sinsideanAnalyzer?▪ AnalyzersneedtoreturnaTokenStream

public TokenStream tokenStream(String fieldName, Reader reader)

TokenStream

Tokenizer TokenFilter

Reader Tokenizer TokenFilter TokenFilter


TokenizersandTokenFilters

▪ Tokenizer▪ WhitespaceTokenizer▪ KeywordTokenizer▪ LetterTokenizer▪ StandardTokenizer▪ ...

▪ TokenFilter▪ LowerCaseFilter▪ StopFilter▪ PorterStemFilter▪ ASCIIFoldingFilter▪ StandardFilter▪ ...


Adding/deletingDocumentsto/fromanIndexWriter

void addDocument(Iterable<IndexableField> d);

IndexWriter’sAnalyzerisusedtoanalyzedocument.Important:NeedtoensurethatAnalyzersusedatindexingtimeareconsistentwithAnalyzersusedatsearchingtime

// deletes docs containing terms or matching// queries. The term version is useful for// deleting one document.void deleteDocuments(Term... terms);void deleteDocuments(Query... queries);


Indexformat▪ EachLuceneindexconsistsofoneormoresegments▪ Asegmentisastandaloneindexforasubsetofdocuments▪ Allsegmentsaresearched▪ AsegmentiscreatedwheneverIndexWriterflushesadds/deletes

▪ Periodically,IndexWriterwillmergeasetofsegmentsintoasinglesegment▪ PolicyspecifiedbyaMergePolicy

▪ YoucanexplicitlyinvokeforceMerge()tomergesegments


Basicmergepolicy▪ Segmentsaregroupedintolevels▪ Segmentswithinalevelareroughlyequalsize(inlogspace)

▪ Oncealevelhasenoughsegments,theyaremergedintoasegmentatthenextlevelup


SearchingachangingindexDirectory dir = FSDirectory.open(...);DirectoryReader reader = DirectoryReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);

Abovereaderdoesnotreflectchangestotheindexunlessyoureopenit.Reopeningismoreresourceefficientthanopeningabrandnewreader.

DirectoryReader newReader = DirectoryReader.openIfChanged(reader);If (newReader != null) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}


Near-real-timesearchIndexWriter writer = ...;DirectoryReader reader = DirectoryReader.open(writer, true);IndexSearcher searcher = new IndexSearcher(reader);

//Nowletussaythere’sachangetotheindexusingwriterwriter.addDocument(newDoc);

DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, true);if (newReader != null) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}


QueryParser

▪ Constructor▪ QueryParser(String defaultField,

Analyzer analyzer);

▪ Parsingmethods▪ Query parse(String query) throws

ParseException;▪ ...andmanymore


QueryParsersyntaxexamplesQueryexpression Documentmatchesif…

java Containsthetermjavainthedefaultfield

javajunitjavaORjunit

Containsthetermjavaorjunitorbothinthedefaultfield(thedefaultoperatorcanbechangedtoAND)

+java+junitjavaANDjunit

Containsbothjavaandjunitinthedefaultfield

title:ant Containsthetermantinthetitlefield

title:extreme–subject:sports Containsextremeinthetitleandnotsportsinsubject

(agileORextreme)ANDjava Booleanexpressionmatches

title:”junitinaction” Phrasematchesintitle

title:”junitaction”~5 Proximitymatches(within5)intitle

java* Wildcardmatches

java~ Fuzzymatches

lastmodified:[1/1/09TO12/31/09]

Rangematches


IndexSearcher

▪ Methods▪ TopDocs search(Query q, int n);▪ Document doc(int docID);


TopDocsandScoreDoc▪ TopDocsmethods▪ NumberofdocumentsthatmatchedthesearchtotalHits

▪ ArrayofScoreDocinstancescontainingresultsscoreDocs

▪ ReturnsbestscoreofallmatchesgetMaxScore()

▪ ScoreDocmethods▪ Documentid doc

▪ Documentscorescore


Scoring▪ Originalscoringfunctionusesbasictf-idfscoringwith▪ Programmableboostvaluesforcertainfieldsindocuments▪ Lengthnormalization▪ Boostsfordocumentscontainingmoreofthequeryterms

▪ IndexSearcherprovidesanexplain()methodthatexplainsthescoringofadocument


Lucene5.0Scoring▪ Aswellastraditionaltf.idfvectorspacemodel,Lucene5.0has:▪ BM25▪ drf(divergencefromrandomness)▪ ib(information(theory)-basedsimilarity)

indexSearcher.setSimilarity( new BM25Similarity());BM25Similarity custom =

new BM25Similarity(1.2, 0.75); // k1, bindexSearcher.setSimilarity(custom);