OCTOBER 1114, 2016 • BOSTON, MA
Apr 16, 2017
O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A
Searching the Enterprise Data Lake with Solr - Watch us do it! Paul Nelson – [email protected]
Chief Architect, Search Technologies
205+ Search Consultants Worldwide
San Diego
San Jose, CR
Cincinna6
Manila, PH Washington (HQ)
• Founded 2005 • Deep search experLse
• 900+ customers worldwide • Consistent profitability
• Search engines & Big Data • Vendor independent
London, UK
Frankfurt, DE Prague, CZ
Agenda • The Enterprise Data Lake (EDL) • Why Search the EDL? • The Process • How To: Step By Step • And then what?
In The Beginning
Applica6on
Computer Users
Database
Dashboards
Reports
Search & Troubleshoo6ng
Alerts
This Evolved to Data Warehouses
Many Computer Users
Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons Dozens of Applica6ons
Extract Transform
Load
Enterprise Data Warehouse
Dashboards
Reports
Search & Troubleshoo6ng
Alerts
And Now the Enterprise Data Lake
Many, many, many Computer Users
Enterprise Data Lake
Dashboards
Reports
Search & Troubleshoo6ng
Alerts Analyze
Hundreds of Applica6ons Raw Data
And Processed Data
What’s new about the Data Lake? • Ingest RAW DATA • Keep it FOREVER • Make it ALL AVAILABLE • Analyze it ONLY WHEN NEEDED • Do it at MASSIVE SCALE
Why the Data Lake? • You never know what’s important up front – New data mining techniques invented daily – Therefore, keep everything
• There is too much data variety – Therefore, only process what you need
• Save money by not ETL’ing useless stuff • There are many different use cases – Shared re-‐use of data by anyone – Data is power! Power to the people!
But Now There’s a Problem: • 10’s of thousands of databases • Billions of records
How to find the data you need?
“People today think search and big data are separate but in two or three years, everyone will wonder why we ever thought that.” Doug Cu?ng Chief Architect, Cloudera Creator of Lucene & Hadoop
The Process
Ingest
1
Research the Data
2
Configure Solr
3
Parse & Index
4
Search & Analyze
5
Produc6on
6
6. Move to Produc6on • Tes6ng, Quality Control – Field processing – Search Features – Analy6cs
• Incremental Processing – Flume, Spark Streaming, Incremental Batches
• Workflow / Scheduled Jobs (Oozie) • Security Controls
Resources • HDFS File System Commands
– hips://hadoop.apache.org/docs/r2.7.3/hadoop-‐project-‐dist/hadoop-‐common/FileSystemShell.html
• solrctl Reference Guide – hips://www.cloudera.com/documenta6on/enterprise/5-‐7-‐x/topics/search_solrctl_ref.html
• Morphlines Reference Guide – hip://kitesdk.org/docs/1.1.0/morphlines/morphlines-‐reference-‐guide.html – hips://github.com/typesafehub/config/blob/master/HOCON.md
• MapReduce Indexer Tool – hips://github.com/cloudera/search/tree/cdh5-‐1.0.0_5.2.1/search-‐mr
• Crunch Indexer – hips://github.com/cloudera/search/tree/cdh5-‐1.0.0_5.2.1/search-‐crunch
• Lily HBase Indexer – hip://www.cloudera.com/documenta6on/enterprise/latest/topics/search_hbase_batch_indexer.html
What’s Next • Explore other analy6c interfaces
– Banana, Zoom Data • Spark
– Streaming Data – Complex Analy6cs à Store results in Solr à More analy6cs!
• Index Many More Collec6ons – Create a Process: Data research à Data Model Design à Implement
• Self-‐Service Inges6on – Document processes for others to use – Templates for inges6on
• Hire Search Technologies!