Top Banner
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata
43

Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Building Google-in-a-box:!using Apache SolrCloud and Bigtop to index your bigdata

Page 2: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Who’s this guy?

Page 3: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Roman Shaposhnik!@rhatr or [email protected]

•  Sr. Manager at Pivotal Inc. building a team of ASF contributors •  ASF junkie

•  VP of Apache Incubator, former VP of Apache Bigtop •  Hadoop/Sqoop/Giraph committer •  contributor across the Hadoop ecosystem)

•  Used to be root@Cloudera •  Used to be a PHB at Yahoo! •  Used to be a UNIX hacker at Sun microsystems •  First time author: “Giraph in action”

Page 4: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

What’s this all about?

Page 5: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

This is NOT this kind of talk

Page 6: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

This is this kind of a talk:

Page 7: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

What are we building?

Page 8: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

WWW analytics platform

HDFS

HBase

MapReduce

Nutch

WWW

Solr Cloud

Lily HBase Indexer

Hive

Hue DataSci

Replication Morphlines

Pig

Zookeeper

Page 9: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Google papers •  GFS (Google FS) == HDFS •  MapReduce == MapReduce •  Bigtable == HBase •  Sawzall == Pig/Hive •  F1 == HAWQ/Impala

Page 10: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Storage design requirements •  Low-level storage layer: KISS

•  commodity hardware •  massively scalable •  highly available •  minimalistic set of APIs (non-POSIX)

•  Application specific storage layer •  leverages LLSL •  Fast r/w random access (vs. immutable streaming) •  Scan operations

Page 11: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Design patterns •  HDFS is the “data lake”

•  Great simplification of storage administration (no-SANAS) •  “Stateless” distributed applications persistence layer

•  Applications are “stateless” compositions of various services •  Can be instantiated anywhere (think YARN) •  Can restart serving up the state from HDFS •  Are coordinated via Zookeeper

Page 12: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Application design: SolrCloud

HDFS

Zookeeper

Solr svc …

Solr svc I am alive

Who Am I? What do I do?

Page 13: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Application design: SolrCloud

HDFS

Zookeeper

Solr svc

Peer is dead What do I do?

Page 14: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Application design: SolrCloud

HDFS

Zookeeper

Solr svc …

Solr svc I am alive

Who Am I? What do I do?

replication kicks in

Page 15: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

How do we build something like this?

Page 16: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

The bill of materials •  HDFS •  Zookeeper •  HBase •  Nutch •  Lily HBase indexer •  SolrCloud •  Morphlines (part of Project Kite) •  Hue •  Hive/Pig/…

Page 17: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done

Page 18: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done

Page 19: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

We’ve seen this before!

GNU Software Linux kernel

Page 20: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Apache Bigtop!

HBase, Solr.. Hadoop

Page 21: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Lets get down to business

Page 22: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Still remember this?

HDFS

HBase

Solr Cloud

Lily HBase Indexer

Replication Morphlines

Zookeeper

Hue DataSci

Page 23: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

HBase: row-key design

com.cnn.www/a.html <html>...

content:

CNN CNN.com

anchor:a.com anchor:b.com

Page 24: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Indexing: schema design •  Bad news: no more “schema on write” •  Good news: you can change it on the fly •  Lets start with the simplest one:!!<field name=”id" type=”string" indexed="true" stored="true” required=“true”/> <field name=”text" type="text_general" indexed="true" stored="true"/> <field name=“url” type=”string" indexed="true" stored="true”/>

Page 25: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Deployment •  Single node pseudo distributed configuration •  Puppet-driven deployment

•  Bigtop comes with modules •  You provide your own cluster topology in cluster.pp

Page 26: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Deploying the ‘data lake’ •  Zookeeper

•  3-5 members of the ensemble # vi /etc/zookeeper/conf/zoo.cfg

# service zookeeper-server init # service zookeeper-server start!

•  HDFS •  tons of configurations to consider: HA, NFS, etc. •  see above, plus: /usr/lib/hadoop/libexec/init-hdfs.sh

Page 27: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

HBase asynchronous indexing •  leveraging WAL for indexing •  can achieve infinite scalability of the indexer •  doesn’t slow down HBase (unlike co-processors) •  /etc/hbase/conf/hbase-site.xml:

<property> <name>hbase.replication</name> <value>true</value> </property>

Page 28: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Different clouds

HBase Region Server

HBase Region Server

Lily Indexer Node

Lily Indexer Node

Solr Node

Solr Node

HBase “cloud” Lily Indexer “cloud” SolrCloud home of Morphline ETL

… … … replication Solr docs

Page 29: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Lily HBase indexer •  Pretends to be a region server on the receiving end •  Gets records •  Pipes them through the Morphline ETL •  Feeds the result to Solr •  All operations are managed via individual indexers

Page 30: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Creating an indexer

$ hbase-indexer add-indexer ! --name web_crawl ! --indexer-conf ./indexer.xml ! --connection-param solr.zk=localhost/solr ! --connection-param solr.collection=web_crawl ! --zookeeper localhost:2181

Page 31: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

indexer.xml <indexer table="web_crawl" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"> <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/> <!-- <param name="morphlineId" value="morphline1"/> à </indexer>

Page 32: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Morphlines •  Part of Project Kite (look for it on GitHub) •  A very flexible ETL library (not just for HBase) •  “UNIX pipes” for bigdata •  Designed for NRT processing •  Record-oriented processing driven by HOCON definition •  Require a “pump” (most of the time) •  Have built-in syncs (e.g. loadSolr) •  Essentially a push-based data flow engine

Page 33: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Different clouds

extractHBaseCells

convertHTML

WAL entries

N records

xquery

logInfo

M records

P records

Solr docs

Page 34: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Morphline spec morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"] commands : [ { extractHBaseCells {…} } { convertHTML {charset : UTF-8} } { xquery {…} } { logInfo { format : "output record: {}", args : ["@{}"] } } ] } ]

Page 35: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

extractHBaseCells { extractHBaseCells { mappings : [ { inputColumn : "content:*" outputField : "_attachment_body" type : "byte[]" source : value } ] } }

Page 36: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

xquery { xquery { fragments : [ { fragmentPath : "/" queryString : """ <fieldsToIndex> <webpage> {for $tk in //text() return concat($tk, ' ')} </webpage> </fieldsToIndex> """ } ] } }

Page 37: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

SolrCloud •  Serves up lucene indices from HDFS •  A webapp running on bigtop-tomcat

•  gets configured via /etc/default/solr!SOLR_PORT=8983

SOLR_ADMIN_PORT=8984 SOLR_LOG=/var/log/solr SOLR_ZK_ENSEMBLE=localhost:2181/solr SOLR_HDFS_HOME=hdfs://localhost:8020/solr SOLR_HDFS_CONFIG=/etc/hadoop/conf!!

Page 38: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Collections and instancedirs •  All of these objects reside in Zookeeper

•  An unfortunate trend we already saw with Lily indexers •  Collection

•  a distributed set of lucene indices •  an object defined by Zookeeper configuration

•  Collection require (and can share) configurations in instancedir •  Bigtop-provided tool: solrcrl! $ solrctl [init|instacedir|collection|…]

Page 39: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Creating a collection # solrctl init $ solrctl instancedir --generate /tmp/web_crawl $ vim /tmp/web_crawl/conf/schema.xml $ vim /tmp/web_crawl/conf/solrconfig.xml $ solrctl instancedir --create web_crawl /tmp/web_crawl $ solrctl collection --create web_crawl -s 1

Page 40: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Hue •  Apache licensed, but not an ASF project •  A nice, flexible UI for Hadoop bigdata management platform •  Follows an extensible app model

Page 41: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Demo time!

Page 42: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Where to go from here

Page 43: Building Google-in-a-box · • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) • Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to

Questions?