Created by Andreas Kamilaris for EPL660 University of Cyprus Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene Andreas Kamilaris
Created by Andreas Kamilaris for EPL660
University of CyprusDepartment of Computer Science
EPL 660: Lab 1General Info, Exercise 1, B-Trees, Apache Lucene
Andreas Kamilaris
2
University of CyprusResearch on the Web of Things
EPL660 3
University of CyprusGeneral info
• Every Friday 18:00-19:30.• Check course Web site for schedule.• Lab content - Exercises, general questions,
tutorials, tool demonstrations.
• Deadlines of exercises: 23:59 at delivery day.• Email submission: [email protected]
EPL660 4
University of CyprusTutorials info
• Review of tools for Information Retrieval.• Every lab session includes introducing some tool.• A variety of libraries and tools:
– Apache Lucene– Apache Solr– Apache Tika– Hadoop– Nutch
EPL660 5
University of CyprusProgram info
• Presentation of the students’ final projectProjects’ Presentations15/04
• Getting Started with NutchNutch8/04
Public HolidayNo Tutorial01/04
Public HolidayNo Tutorial25/03
• Background Information about Crawling• Introduction to Nutch
Nutch18/03
• Getting Started with Hadoop• Demonstration of a simple scenario
Hadoop11/03
• Background information about MapReduce• Introduction to Hadoop
Hadoop4/03
Absence of AssistantNo Tutorial25/02
• Introduction to Apache Tika• Demostration of a simple scenario
Apache Tika18/02
• Introduction to Apache Solr• Demonstration of a simple scenario
Apache Solr11/02
• Getting Started with Apache Lucene• Demonstration of a simple scenario
Apache Lucene4/02
• Introduction to Apache Lucene• Background Information for B-Trees
Apache Lucene28/01
DescriptionTopicDate
EPL660 6
University of Cyprus1st Programming Exercise
• Create a doc-based inverted index.• Records have the format:
• Include stemming using Porter Stemmer algorithm.• Include detection of stop-words.• Search terms using B-Trees.• The B-Tree must be a 4-ordered tree.• Add skip pointers to inverted index for
performance reasons.
Positional Posting ListFrequencyterm
EPL660 7
University of Cyprus1st Programming Exercise
• Deadline is 8th February 2011.• You need to include:
– Source code with comments.– Executable files.– A Brief Documentation.
• E-mail Submission including a zip attachment.
EPL660 8
University of CyprusIntroduction to B-Trees• A B-Tree of order m is an m-way tree (a tree where each
node may have up to m children) in which:1. the number of keys in each non-leaf node is one less than the
number of its children and these keys partition the keys in the children in the fashion of a search tree.
2. all leaves are on the same level.3. all non-leaf nodes except the root have at least ⎡m / 2⎤ children.4. the root is either a leaf node, or it has from two to m children.5. a leaf node contains no more than m – 1 keys.
• B-trees are always balanced!
EPL660 9
University of CyprusWhy using B-Trees• It was difficult to access a large amount of data from a
secondary memory.
• Many algorithms were introduced to make search faster, to access the required data from the secondary memory more optimized.
• B-Trees are more effective and faster.• B-Trees are used in many database management systems.
EPL660 10
University of CyprusAn example B-TreeA B-tree of order 4 containing 26 items:
51 6242
6 12
26
55 60 7064 9045
1 2 4 7 8 13 15 18 25
27 29 46 48 53
Note that all the leaves are at the same levelNote that all the leaves are at the same level
EPL660 11
University of CyprusSearching a B-TreeSearch for the item #48:
51 6242
6 12
26
55 60 7064 9045
1 2 4 7 8 13 15 18 25
27 29 46 48 53
Note that all the leaves are at the same levelNote that all the leaves are at the same level
EPL660 12
University of CyprusConstructing a B-Tree• Suppose we start with an empty B-tree and keys arrive in
the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45
• We want to construct a B-tree of order 5• The first four items go into the root:
• To put the fifth item in the root would violate condition 5• Therefore, when 25 arrives, pick the middle key to make a
new root
1 2 8 12
EPL660 13
University of CyprusConstructing a B-Tree
1 2
8
12 25
6, 14, 28 get added to the leaf nodes:
1 2
8
12 146 25 28
EPL660 14
University of CyprusConstructing a B-Tree
Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf:
8 17
12 14 25 281 2 6
7, 52, 16, 48 get added to the leaf nodes:8 17
12 14 25 281 2 6 16 48 527
EPL660 15
University of CyprusConstructing a B-Tree
Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf,promoting 3 to the root; 26, 29, 53, 55 then go into the leaves:
3 8 17 48
52 53 55 6825 26 28 291 2 6 7 12 14 16
Adding 45 causes a split of: 25 26 28 29
and promoting 28 to the root then causes the root to split.
EPL660 16
University of CyprusConstructing a B-Tree
17
3 8 28 48
1 2 6 7 12 14 16 52 53 55 6825 26 29 45
EPL660 17
University of CyprusGuidelines for constructing a B-Tree
1. Attempt to insert the new key into a leaf by searching for the proper position.
2. If the leaf is not full, then insert the key and you are done.3. If this would result in that leaf becoming too big, split the
leaf into two, promoting the middle key to the leaf’s parent4. If this would result in the parent becoming too big, split the
parent into two, promoting the middle key.5. This strategy might have to be repeated all the way to the
top.6. If necessary, the root is split in two and the middle key is
promoted to a new root, making the tree one level higher.
EPL660 18
University of CyprusTime complexity of a B-Tree• Search/Insert/Delete all take up to the number of items in
a path from the root to a leaf.• The total number of operations is no more than the height
of the tree.• The height of a tree is no more than log(n) where n is the
number of items in the B-Tree.
University of CyprusDepartment of Computer Science
Tutorial 1Apache Lucene Overview
EPL660 20
University of CyprusWhat is Apache Lucene?
“Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.”
- from http://lucene.apache.org/
EPL660 21
University of CyprusWhat is Apache Lucene?• Lucene is specifically an API, not an application.• Hard parts have been done, easy programming
has been left to you.• You can build a search application that is
specifically suited to your needs .• You can use Lucene to provide consistent full-text
indexing across both database objects anddocuments in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).
EPL660 22
University of CyprusAvailability• Freely Available (no cost)• Open Source
– Apache License, version 2.0• http://www.apache.org/licenses/LICENSE-2.0
– Download from:• http://www.apache.org/dyn/closer.cgi/lucene/java/
EPL660 23
University of CyprusFeatures
• Ranked Searching• Flexible Queries
– Phrases, Wildcards, etc…• Field-specific Queries
– e.g. title, artist, album• Sorting
EPL660 24
University of CyprusRanked Searching
1. Phrase Matching2. Keyword Matching
– Prefer more unique terms first • takes into account the uniqueness of each term when
determining a document’s relevance score
EPL660 25
University of CyprusFlexible Queries
• Phrases“star wars”
• Wildcardsstar*Bra?il
• Ranges{star-stun}[2006-2007]
• Boolean Operatorsstar AND wars
This is just a small subset of the types of queries that Lucene can support. Some query types such as wildcard and range queries have a potential to cause heavy load on the Lucene server, so Lucene makes it easy to disable certain types of queries while allowing all others to proceed through the system. This gives programmers better control and allows the system performance to be more predictable.
EPL660 26
University of CyprusField-specific Queries
• For example
title:”star wars”AND
director:”George Lucas”
EPL660 27
University of CyprusSorting
• Can sort any field in a Document– For example, by Price, Release Date, Amazon Sales
Rank, etc…• By default, Lucene will sort results by their
relevance score. Sorting by any other field in a Document is also supported.
EPL660 28
University of CyprusDocuments
• A document can represent anything textual:– Word Document– DVD (the textual metadata only)– Website Member (name, ID, etc…)
• A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database.
• Each developer is responsible for turning their own data sets into Lucene Documents.
• Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files.
EPL660 29
University of CyprusIndexes
• Lucene employs inverted indexing (like most full-text-based search engines).
• Indexes track term frequencies.• Every term maps back to a Document.• This index is what allows Lucene to quickly locate every
document currently associated with a given set of input search terms.
EPL660 30
University of Cyprus
An index consists of one or more Lucene documents.
1. Create a document:– A document consists of one or more fields: name-value
pairExample: A field commonly found in applications is title. In the case of a title field, the field name is title and thevalue is the title of that item.
– Add one or more fields to the document.2. Add the document to an index:
– Indexing involves adding documents to an IndexWriter.
3. Indexer will analyze the Document:– We can provide specialized analyzers such as
StandardAnalyzer.
Basic Indexing
EPL660 31
University of CyprusAnalyzing• Analyzers control how the text is broken into terms which
are then used to index the document. • Analyzers can be used to remove stop words and they
also perform stemming.• Lucene comes with a default analyzer which works well for
unstructured English text, however it often performs incorrect normalizations on non-English texts.
• Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own.
• Lucene even includes a number of stemming algorithms for various languages, which can improve document retrieval accuracy when the source language is known at indexing time.
EPL660 32
University of CyprusBasic SearchingSearching requires an index to have already been built.1. Create a Query:
• Usually via QueryParser, MultiPhraseQuery etc. that parse user input.
2. Open an Index:3. Search the Index:
• E.g. via IndexSearcher.• Use an Analyzer (as before).
4. Iterate through returned Documents:• Extract out needed results.• Extract out result scores (if needed).
EPL660 33
University of CyprusLucene as a Web Service1. Design an HTTP query syntax
– GET queries– XML for results
2. Wrap Tomcat around core code• Tomcat is a source software implementation of the
Java Servlet and JavaServer Pages technologies3. Write a Client Library
EPL660 34
University of CyprusScalability Limits• 3 main scalability factors:
– Query Rate– Index Size– Update Rate
EPL660 35
University of CyprusQuery Rate Scalability• Lucene is already fast:
– Built-in simple cache mechanism• Easy solution for heavy workloads:
– Add more query servers behind a load balancer– Can grow as your traffic grows
EPL660 36
University of CyprusIndex Size Scalability• Can easily handle millions of documents
– Lucene is very commonly deployed into systems with 10s of millions of documents.
• Although query performance can degrade as more documents are added to the index, the growth factor is very low.
• The main limits related to index size that you are likely to run into, will be disk capacity and disk I/O limits.
If you need bigger index:• Built-in methods to allow queries to span multiple remote
Lucene indexes– Can merge multiple remote indexes at query-time.
EPL660 37
University of CyprusLucene Installation1. Download the latest version of Lucene (v3.0.3) from:
http://www.apache.org/dyn/closer.cgi/lucene/java/2. Add files lucene-core-{version}.jar and lucene-demos-
{version}.jar in your Java CLASSPATH.3. Start programming!
(Optional Step)4. Go to Lucene-{version}/src/demo/org/apache/lucene/demo
directory and start editing files IndexFiles.java and SearchFiles.java.
EPL660 38
University of CyprusUseful Info• Official Apache Lucene site: http://lucene.apache.org/java/docs/• Lucene-java Wiki: http://wiki.apache.org/lucene-
java/FrontPage?action=show&redirect=FrontPageEN• Lucene Intro (java.net):
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html• Lucene Tutorial.com: http://www.lucenetutorial.com/