Top Banner
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
22

GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Dec 24, 2015

Download

Documents

Prosper Hancock
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

GOAT SEARCHRevorg GOAT Search Solution (Powered by Lucene)

Page 2: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

About Me

Grover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M

University) Stanford Project Management Courses

Page 3: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

About Me 10+ years of development, analysis, and

implementation 10+ years of ColdFusion experience 2+ years of Java experience Commonspot, Strongmail, ClickFix

(Developer) Email: [email protected] Web site: http://www.groverfields.com

Page 4: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Agenda What?

What can we do with GOAT? Why?

Why do we want to use GOAT and not Verity? How?

How do we do that? Conclusion and alternative solutions

Page 5: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

What What is a Search Engine?

Builds an index on text Answers queries using that index, a la Verity

Existing database already

A search engine offers? Scalability Reliance Ranking Tweaking Integrates different sources (email, web pages, files,

DATABASES)

Page 6: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

What is a search engine? (cont.)

Works on words, not on substrings Auto != automatic, automobile

Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index

Page 7: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Apache Lucene Overview Lucene Java 2.4

A high-performance, full-featured text search engine library written entirely in Java.

It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

No GUI http://lucene.apache.org

Page 8: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Apache Lucene Overview Java library for indexing and searching No dependencies Works with Java 1.4 or later Input for indexing: Document objects

Each document: set of Fields, field name, field content Stores its index as files on disk or memory No document converters No web crawler

Page 9: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Lucene Java users HBCU.info LinkedIn IBM OmniFind Yahoo! Edition Techorati.com Eclipse Monster.com …

Page 10: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Lucene Java Summary

Java Library for indexing and searching Lightweight /no dependencies Powerful and fast and tested! No document conversion No GUI

Page 11: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Why?

Cost of Enterprise Search Solution Need for search speed Java projects to work on

Things to do

Page 12: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Verity Limitations 10,000 documents for ColdFusion Developer Edition

125,000 documents of ColdFusion Standard Edition

250,000 documents for ColdFusion Enterprise Edition What do developers do in a shared hosting

environment? Is it possible for the hosting company to limit the

number of documents per Web site?

Page 13: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

T-SQL Limitations? Search for “Yahoo” on my blog

SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC

Multiply that time 10, 100, 500, or 1000 users/hr?

Page 14: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

T-SQL Limitations?

Full table scan = 1 THING PERFORMANCE KILLER!!! No search sorting

RDBMS isn’t designed to do this but allows it Use the right tools!

Page 15: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

How? GOAT Search Solution

Lucene 2.4.0 ColdFusion MX 8

MX is fine but GUI needs to be rolled back Commons IO 1.4

Simply package .jar files Simply Web based GUI

Page 16: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

How? Macromedia JDBC Drivers

Same drivers that ColdFusion uses No additional drivers to install

Supports RDBMS ONLY MSSQL MySQL Oracle

No File system support (Yet)

Page 17: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Basics? Indexing extracts both meaning and structure from

unstructured information by indexing each document Contains a complete list of all the words used in a given

document along with metadata about that document Lucene creates a collection that normalizes both the

structured and unstructured data. Search requests then check these collections rather than

scanning the actual documents and database fields. This provides a faster search of information, regardless of the

file type and whether the source is structured or unstructured.

Page 18: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Basics? Collection

A special database created by Lucene that contains metadata that describes the documents Documents

A sequence of fields Similar to a row in a database table

Row 1 Row 2, etc

Fields A named sequence of terms Similar to a column in a table

Primary Key Column 1

Terms Is a string

Page 19: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Knowledge? Index

A special database created by Lucene that contains metadata that describes the documents

Query Syntax Similar to Google’s advanced search:

field:value E.G. resume: coldfusion http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Results Primary Key list of values XML based on the document CFX Tag integration

Page 20: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Alternative Solutions for Search Commercial vendors:

FAST, $100k Autonomy, $80k Google, $50k

Commercial search engines based on Lucene IBM OmniFind Yahoo Edition

RDBMS with Integrated Search Oracle MySQL MSSQL PERFORMANCE KILLERS

Page 21: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

RoadMap

Road Map

A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials.

Overhaul Java programming (still novice) Integrate with other products

Aperture Nutch Solr

File system integration .txt, .pdf, .doc, .ppt, etc.

Geospatial based searches E.G. All jobs within a 50 mile radius

Page 22: GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

References

Apache.org Adobe.com Ben Forta’s Blog Slideshare.net

Multiple authors Other references