Top Banner
Confidential © Copyright 2012 Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief Scientist
12

Leveraging Solr and Mahout

May 10, 2015

Download

Technology

Grant Ingersoll

My talk from last night's Big Data Warehouse meetup in NYC on using Solr and Mahout to build next generation data access tools
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Leveraging Solr and Mahout

Confidential © Copyright 2012

Leveraging Solr and Mahout for Next Gen Data Access and Insight

Grant IngersollChief Scientist

Page 2: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search is Dead, Long Live Search

Content

Users

Access

Content Relationships

• Modern Data Challenges are multi-structured

• Search is a system building block- Text is only a part of the story

• If the algorithms fit,

use them!

• Embrace fuzziness!

• Scoring features are everywhere

Page 3: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks3

Topics

• Intros

• Search (R)Evolution

• Apache Solr• Apache Mahout

• Search and Machine Learning

• Scaling

Page 4: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

• Co-founder:- LucidWorks – Chief Scientist- Apache Mahout

• Long time Lucene/Solr committer• Author: Taming Text

- www.manning.com/ingersoll

• Background in IR and NLP- Built CLIR, QA and a variety of other search-based apps

Grant’s Background

Page 5: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search (R)evolution

• Search use leads to search abuse- Denormalization frees your mind- Scoring is just a sparse matrix multiply

• Lucene/Solr evolution- Non-free text usages abound- Many DB-like features- NoSQL before NoSQL was cool- Flexible indexing- Finite State Transducers FTW!

• Scale

• “This ain’t your father’s relevance anymore”

Page 6: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Apache Solr?

• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “- http://lucene.apache.org/solr

• Did I mention free?

Page 7: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Apache Mahout

• Goal: create library of scalable machine learning algorithms

• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery- Collaborative Filtering- Classification- Clustering

• Also: - Collocations (Statistically Interesting Phrases)- SVD- Java math, primitives libraries and more

Page 8: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search + Machine Learning

• Search-driven applications present multiple opportunities for leveraging machine learning- Clustering – Enhance Discovery, outlier detection- Classification – Queries, Documents, Users- Content Recommendation – Collab. Filtering and

personalization- NLP – phrases, named entities, co-reference, much more

• Many of these can also power faceted navigation

• Aside: Search can also often be used effectively to implement many machine learning algorithms

Page 9: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

How and When

Shards

12

3 N

Search View

•Documents •Users •Logs

DocumentStore

Analytic Services

•View into numeric/historic data

•Classification•Recommendation

Personalization & Machine Learning

Services

Classification Models

In memoryReplicatedMulti-tenant

Discovery & EnrichmentClustering, classification, NLP, topic identification, search log analysis, user behavior

Content AcquisitionETL, batch or near real-time

Access APIs

Data• LucidWorks Search

connectors• Push

Page 10: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Scaling

• Search- Solr Cloud = Large scale, distributed search and faceting

» http://wiki.apache.org/solr/SolrCloud

• Machine Learning- Mahout is built on Hadoop for most things- SGD is sequential and really fast

• Sometimes all you can do is make an educated guess- Storm, Kafka, etc. can help by allowing you to make estimates

in near real time

Page 11: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Wrap

• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data- http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based applications

Page 12: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Resources

• LucidWorks- http://www.lucidworks.com- http://www.lucidworks.com/products/lucidworks-big-data- @LucidImagineer

• Me- [email protected] @gsingers

• Taming Text- http://www.manning.com/ingersoll- http://www.tamingtext.com- @tamingtext