Lucene Tutorial borrowing from: Chris Manning and Pandu Nayak
LuceneTutorial
borrowingfrom:ChrisManningandPanduNayak
IntroductiontoInformationRetrieval
OpensourceIRsystems▪ Widelyusedacademicsystems▪ Terrier(Java,U.Glasgow)http://terrier.org▪ Indri/Galago/Lemur(C++(&Java),U.Mass&CMU)▪ Tailofothers(Zettair,…)
▪ Widelyusednon-academicopensourcesystems▪ Lucene
▪ Thingsbuiltonit:Solr,ElasticSearch
▪ Afewothers(Xapian,…)
IntroductiontoInformationRetrieval
Lucene▪ OpensourceJavalibraryforindexingandsearching▪ Letsyouaddsearchtoyourapplication▪ Notacompletesearchsystembyitself▪ WrittenbyDougCutting
▪ Usedby:Twitter,LinkedIn,Zappos,CiteSeer,Eclipse,…▪ …andmanymore(seehttp://wiki.apache.org/lucene-java/PoweredBy)
▪ Ports/integrationstootherlanguages▪ C/C++,C#,Ruby,Perl,Python,PHP,…
IntroductiontoInformationRetrieval
Basedon“LuceneinAction”ByMichaelMcCandless,ErikHatcher,OtisGospodnetic
CoversLucene3.0.1.It’snowupto5.3.1
IntroductiontoInformationRetrieval
Resources▪ Lucene:http://lucene.apache.org
▪ LuceneinAction:http://www.manning.com/hatcher3/▪ Codesamplesavailablefordownload
▪ Ant:http://ant.apache.org/▪ Javabuildsystemusedby“LuceneinAction”code
IntroductiontoInformationRetrieval
Luceneinasearchsystem
RawContent
Acquirecontent
Builddocument
Analyzedocument
Indexdocument
Index
Users
SearchUI
Buildquery
Renderresults
Runquery
IntroductiontoInformationRetrieval
Lucenedemos▪ Sourcefilesinlia2e/src/lia/meetlucene/
▪ ActualsourcesuseLucene3.0.1▪ CodeintheseslidesupgradedtoLucene5
▪ CommandlineIndexer▪ lia.meetlucene.Indexer
▪ CommandlineSearcher▪ lia.meetlucene.Searcher
IntroductiontoInformationRetrieval
Coreindexingclasses▪ IndexWriter▪ Centralcomponentthatallowsyoutocreateanewindex,openanexistingone,andadd,remove,orupdatedocumentsinanindex
▪ BuiltonanIndexWriterConfigandaDirectory
▪ Directory▪ Abstractclassthatrepresentsthelocationofanindex
▪ Analyzer▪ Extractstokensfromatextstream
IntroductiontoInformationRetrieval
CreatinganIndexWriterImport org.apache.lucene.analysis.Analyzer;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.store.Directory;...
private IndexWriter writer;
public Indexer(String dir) throws IOException { Directory indexDir = FSDirectory.open(new File(dir)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig cfg = new IndexWriterConfig(analyzer); cfg.setOpenMode(OpenMode.CREATE); writer = new IndexWriter(indexDir, cfg)}
IntroductiontoInformationRetrieval
Coreindexingclasses(contd.)▪ Document▪ RepresentsacollectionofnamedFields.TextintheseFieldsareindexed.
▪ Field▪ Note:LuceneFieldscanrepresentboth“fields”and“zones”asdescribedinthetextbook
▪ Orevenotherthingslikenumbers.▪ StringFieldsareindexedbutnottokenized▪ TextFieldsareindexedandtokenized
IntroductiontoInformationRetrieval
ADocumentcontainsFieldsimport org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;...protected Document getDocument(File f) throws Exception {
Document doc = new Document();
doc.add(new TextField("contents”, new FileReader(f))) doc.add(new StringField("filename”, f.getName(),
Field.Store.YES)); doc.add(new StringField("fullpath”,
f.getCanonicalPath(),
Field.Store.YES)); return doc;}
IntroductiontoInformationRetrieval
IndexaDocumentwithIndexWriter
private IndexWriter writer;...private void indexFile(File f) throws
Exception {Document doc = getDocument(f);writer.addDocument(doc);
}
IntroductiontoInformationRetrieval
Indexingadirectoryprivate IndexWriter writer;...public int index(String dataDir,
FileFilter filter)throws Exception {
File[] files = new File(dataDir).listFiles();for (File f: files) {
if (... && (filter == null || filter.accept(f))) {indexFile(f);
}}return writer.numDocs();
}
IntroductiontoInformationRetrieval
ClosingtheIndexWriter
private IndexWriter writer;...public void close() throws IOException {
writer.close();}
IntroductiontoInformationRetrieval
TheIndex▪ TheIndexisthekindofinvertedindexweknowandlove
▪ ThedefaultLucene50codecis:▪ variable-byteandfixed-widthencodingofdeltavalues▪ multi-levelskiplists▪ naturalorderingofdocIDs▪ encodesbothtermfrequenciesandpositionalinformation
▪ APIstocustomizethecodec
IntroductiontoInformationRetrieval
Coresearchingclasses▪ IndexSearcher▪ Centralclassthatexposesseveralsearchmethodsonanindex▪ AccessedviaanIndexReader
▪ Query▪ Abstractqueryclass.Concretesubclassesrepresentspecifictypesofqueries,e.g.,matchingtermsinfields,booleanqueries,phrasequeries,…
▪ QueryParser▪ ParsesatextualrepresentationofaqueryintoaQuery instance
IntroductiontoInformationRetrieval
IndexSearcher
IndexSearcher
IndexReader
Directory
Query TopDocs
IntroductiontoInformationRetrieval
CreatinganIndexSearcher
import org.apache.lucene.search.IndexSearcher;...public static void search(String indexDir,
String q)throws IOException, ParseException {
IndexReader rdr =DirectoryReader.open(FSDirectory.open(
new File(indexDir)));IndexSearcher is = new IndexSearcher(rdr);...
}
IntroductiontoInformationRetrieval
QueryandQueryParserimport org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Query;...public static void search(String indexDir, String q)
throws IOException, ParseException ...QueryParser parser =
new QueryParser("contents”, new StandardAnalyzer());
Query query = parser.parse(q);...
}
IntroductiontoInformationRetrieval
Coresearchingclasses(contd.)▪ TopDocs▪ Containsreferencestothetopdocumentsreturnedbyasearch
▪ ScoreDoc▪ Representsasinglesearchresult
IntroductiontoInformationRetrieval
search()returnsTopDocsimport org.apache.lucene.search.TopDocs;...public static void search(String indexDir,
String q)throws IOException, ParseException
...IndexSearcher is = ...;...Query query = ...;...TopDocs hits = is.search(query, 10);
}
IntroductiontoInformationRetrieval
TopDocscontainScoreDocsimport org.apache.lucene.search.ScoreDoc;...public static void search(String indexDir, String q)
throws IOException, ParseException ...IndexSearcher is = ...;...TopDocs hits = ...;...for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = is.doc(scoreDoc.doc);System.out.println(doc.get("fullpath"));
}}
IntroductiontoInformationRetrieval
ClosingIndexSearcher
public static void search(String indexDir, String q)
throws IOException, ParseException ...IndexSearcher is = ...;...is.close();
}
IntroductiontoInformationRetrieval
HowLucenemodelscontent▪ ADocumentistheatomicunitofindexingandsearching▪ ADocumentcontainsFields
▪ Fieldshaveanameandavalue▪ YouhavetotranslaterawcontentintoFields▪ Examples:Title,author,date,abstract,body,URL,keywords,...
▪ Differentdocumentscanhavedifferentfields▪ Searchafieldusingname:term,e.g.,title:lucene
IntroductiontoInformationRetrieval
Fields▪ Fieldsmay▪ Beindexedornot
▪ Indexedfieldsmayormaynotbeanalyzed(i.e.,tokenizedwithanAnalyzer)▪ Non-analyzedfieldsviewtheentirevalueasasingletoken(usefulforURLs,paths,dates,socialsecuritynumbers,...)
▪ Bestoredornot▪ Usefulforfieldsthatyou’dliketodisplaytousers
IntroductiontoInformationRetrieval
Fieldconstruction Lotsofdifferentconstructorsimport org.apache.lucene.document.Fieldimport org.apache.lucene.document.FieldType
Field(String name, String value, FieldType type);
value canalsobespecifiedwithaReader,aTokenStream,orabyte[].
FieldTypespecifiesfieldproperties.
Canalsodirectlyusesub-classeslikeTextField,StringField,…
IntroductiontoInformationRetrieval
UsingField propertiesIndex Store Exampleusage
NOT_ANALYZED YES Identifiers,telephone/SSNs,URLs,dates,...
ANALYZED YES Title,abstract
ANALYZED NO Body
NO YES Documenttype,DBkeys(ifnotusedforsearching)
NOT_ANALYZED NO Hiddenkeywords
IntroductiontoInformationRetrieval
Analyzer
▪ Tokenizestheinputtext▪ CommonAnalyzers▪ WhitespaceAnalyzer Splitstokensonwhitespace
▪ SimpleAnalyzer Splitstokensonnon-letters,andthenlowercases
▪ StopAnalyzer SameasSimpleAnalyzer,butalsoremovesstopwords
▪ StandardAnalyzer Mostsophisticatedanalyzerthatknowsaboutcertaintokentypes,lowercases,removesstopwords,...
IntroductiontoInformationRetrieval
Analysisexample▪ “Thequickbrownfoxjumpedoverthelazydog”▪ WhitespaceAnalyzer▪ [The][quick][brown][fox][jumped][over][the][lazy][dog]
▪ SimpleAnalyzer▪ [the][quick][brown][fox][jumped][over][the][lazy][dog]
▪ StopAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]
▪ StandardAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]
IntroductiontoInformationRetrieval
Anotheranalysisexample▪ “XY&ZCorporation–[email protected]”▪ WhitespaceAnalyzer▪ [XY&Z][Corporation][-][[email protected]]
▪ SimpleAnalyzer▪ [xy][z][corporation][xyz][example][com]
▪ StopAnalyzer▪ [xy][z][corporation][xyz][example][com]
▪ StandardAnalyzer▪ [xy&z][corporation][[email protected]]
IntroductiontoInformationRetrieval
What’sinsideanAnalyzer?▪ AnalyzersneedtoreturnaTokenStream
public TokenStream tokenStream(String fieldName, Reader reader)
TokenStream
Tokenizer TokenFilter
Reader Tokenizer TokenFilter TokenFilter
IntroductiontoInformationRetrieval
TokenizersandTokenFilters
▪ Tokenizer▪ WhitespaceTokenizer▪ KeywordTokenizer▪ LetterTokenizer▪ StandardTokenizer▪ ...
▪ TokenFilter▪ LowerCaseFilter▪ StopFilter▪ PorterStemFilter▪ ASCIIFoldingFilter▪ StandardFilter▪ ...
IntroductiontoInformationRetrieval
Adding/deletingDocumentsto/fromanIndexWriter
void addDocument(Iterable<IndexableField> d);
IndexWriter’sAnalyzerisusedtoanalyzedocument.Important:NeedtoensurethatAnalyzersusedatindexingtimeareconsistentwithAnalyzersusedatsearchingtime
// deletes docs containing terms or matching// queries. The term version is useful for// deleting one document.void deleteDocuments(Term... terms);void deleteDocuments(Query... queries);
IntroductiontoInformationRetrieval
Indexformat▪ EachLuceneindexconsistsofoneormoresegments▪ Asegmentisastandaloneindexforasubsetofdocuments▪ Allsegmentsaresearched▪ AsegmentiscreatedwheneverIndexWriterflushesadds/deletes
▪ Periodically,IndexWriterwillmergeasetofsegmentsintoasinglesegment▪ PolicyspecifiedbyaMergePolicy
▪ YoucanexplicitlyinvokeforceMerge()tomergesegments
IntroductiontoInformationRetrieval
Basicmergepolicy▪ Segmentsaregroupedintolevels▪ Segmentswithinalevelareroughlyequalsize(inlogspace)
▪ Oncealevelhasenoughsegments,theyaremergedintoasegmentatthenextlevelup
IntroductiontoInformationRetrieval
SearchingachangingindexDirectory dir = FSDirectory.open(...);DirectoryReader reader = DirectoryReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);
Abovereaderdoesnotreflectchangestotheindexunlessyoureopenit.Reopeningismoreresourceefficientthanopeningabrandnewreader.
DirectoryReader newReader = DirectoryReader.openIfChanged(reader);If (newReader != null) {
reader.close();reader = newReader;searcher = new IndexSearcher(reader);
}
IntroductiontoInformationRetrieval
Near-real-timesearchIndexWriter writer = ...;DirectoryReader reader = DirectoryReader.open(writer, true);IndexSearcher searcher = new IndexSearcher(reader);
//Nowletussaythere’sachangetotheindexusingwriterwriter.addDocument(newDoc);
DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, true);if (newReader != null) {
reader.close();reader = newReader;searcher = new IndexSearcher(reader);
}
IntroductiontoInformationRetrieval
QueryParser
▪ Constructor▪ QueryParser(String defaultField,
Analyzer analyzer);
▪ Parsingmethods▪ Query parse(String query) throws
ParseException;▪ ...andmanymore
IntroductiontoInformationRetrieval
QueryParsersyntaxexamplesQueryexpression Documentmatchesif…
java Containsthetermjavainthedefaultfield
javajunitjavaORjunit
Containsthetermjavaorjunitorbothinthedefaultfield(thedefaultoperatorcanbechangedtoAND)
+java+junitjavaANDjunit
Containsbothjavaandjunitinthedefaultfield
title:ant Containsthetermantinthetitlefield
title:extreme–subject:sports Containsextremeinthetitleandnotsportsinsubject
(agileORextreme)ANDjava Booleanexpressionmatches
title:”junitinaction” Phrasematchesintitle
title:”junitaction”~5 Proximitymatches(within5)intitle
java* Wildcardmatches
java~ Fuzzymatches
lastmodified:[1/1/09TO12/31/09]
Rangematches
IntroductiontoInformationRetrieval
IndexSearcher
▪ Methods▪ TopDocs search(Query q, int n);▪ Document doc(int docID);
IntroductiontoInformationRetrieval
TopDocsandScoreDoc▪ TopDocsmethods▪ NumberofdocumentsthatmatchedthesearchtotalHits
▪ ArrayofScoreDocinstancescontainingresultsscoreDocs
▪ ReturnsbestscoreofallmatchesgetMaxScore()
▪ ScoreDocmethods▪ Documentid doc
▪ Documentscorescore
IntroductiontoInformationRetrieval
Scoring▪ Originalscoringfunctionusesbasictf-idfscoringwith▪ Programmableboostvaluesforcertainfieldsindocuments▪ Lengthnormalization▪ Boostsfordocumentscontainingmoreofthequeryterms
▪ IndexSearcherprovidesanexplain()methodthatexplainsthescoringofadocument
IntroductiontoInformationRetrieval
Lucene5.0Scoring▪ Aswellastraditionaltf.idfvectorspacemodel,Lucene5.0has:▪ BM25▪ drf(divergencefromrandomness)▪ ib(information(theory)-basedsimilarity)
indexSearcher.setSimilarity( new BM25Similarity());BM25Similarity custom =
new BM25Similarity(1.2, 0.75); // k1, bindexSearcher.setSimilarity(custom);