Top Banner
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens
14

The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

The Lucene Search Engine

Kira Radinsky

Based on the material from: Thomas Paul and Steven J. Owens

Page 2: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

What is Lucene?

• Doug Cutting’s grandmother’s middle name• A open source set of Java Classses– Search Engine/Document Classifier/Indexer– Developed by Doug Cutting (1996)

• Xerox/Apple/Excite/Nutch/Yahoo/Cloudera• Hadoop founder, Board of directors of the Apache Software

• Jakarta Apache Product. Strong open source community support.

• High-performance, full-featured text search engine library

• Easy to use yet powerful API

Page 3: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Use the Source, Luke

• Document • Field

– Represents a section of a Document: name for the section + the actual data.• Analyzer

– Abstract class (to provide interface)– Document -> tokens (for later indexing)– StandardAnalyzer class.

• IndexWriter – Creates and maintains indexes.

• IndexSearcher – Searches through an index.

• QueryParser – Builds a parser that can search through an index.

• Query – Abstract class that contains the search criteria created by the QueryParser.

• Hits – Contains the Document objects that are returned by running the Query object against the

index.

Page 4: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Indexing a Document

Page 5: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Document from an articleprivate Document createDocument(String article, String

author, String title, String topic, String url,

Date dateWritten) {

Document document = new Document(); document.add(Field.Text("author", author)); document.add(Field.Text("title", title)); document.add(Field.Text("topic", topic)); document.add(Field.UnIndexed("url", url)); document.add(Field.Keyword("date", dateWritten)); document.add(Field.UnStored("article", article)); return document; }

Page 6: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

The Field Object

Factory Method Tokenized Indexed Stored Use forField.Text(String name, String value) Yes Yes Yes contents you want

storedField.Text(String name, Reader value) Yes Yes No contents you don't

want storedField.Keyword(String name, String value) No Yes Yes values you don't want

broken downField.UnIndexed(String name, String value) No No Yes values you don't want

indexedField.UnStored(String name, String value) Yes Yes No values you don't want

stored

Page 7: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Store a Document in the indexString indexDirectory = "lucene-index";private void indexDocument(Document document)

throws Exception

{Analyzer analyzer = new StandardAnalyzer();IndexWriter writer = new IndexWriter(

indexDirectory, analyzer, false);

writer.addDocument(document);writer.optimize();writer.close();

}

Page 8: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Analyzers and TokenizersSimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all

of the input to lower case.

StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words.

StandardAnalyzer StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA").

Lucene Sandbox Here you can find analyzers in your own language

Page 9: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Adding to an Indexpublic void indexArticle(

String article, String author,String title, String topic,String url, Date dateWritten)

throws Exception { Document document = createDocument

(article, author, title, topic, url, dateWritten);

indexDocument(document);}

Page 10: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Searching the Index

Page 11: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Searching

IndexSearcher is = new IndexSearcher(indexDirectory);

Analyzer analyzer = new StandardAnalyzer();QueryParser parser = new QueryParser("article",

analyzer);Query query = parser.parse(searchCriteria);Hits hits = is.search(query);

Page 12: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Extracting Document objects

for (int i=0; i<hits.length(); i++) {

Document doc = hits.doc(i); // display the articles that were found to the user

}

Page 13: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Search Criteria

Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches– author:Henry relativity AND "quantum physics“– "string theory" NOT Einstein– "Galileo Kepler"~5– author:Johnson date:[01/01/2004 TO 01/31/2004]

Page 14: The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

Thread Safety

• Indexing and searching are not only thread safe, but process safe. What this means is that:– Multiple index searchers can read the lucene index files

at the same time.– An index writer or reader can edit the lucene index files

while searches are ongoing– Multiple index writers or readers can try to edit the

lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock).

• The query parser is not thread safe, • The index writer however, is thread safe,