Top Banner
http://purdygoodengineering.com http://anant.us Accumulo and Spark With MLLib and GraphX
32

Machine Learning & Graph Processing w/ Spark and Accumulo

Apr 15, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Accumulo and SparkWith MLLib and GraphX

Page 2: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Introduction● Section 1: Understanding the Technology

○ Big Picture○ Accumulo ○ Spark○ Example Code

● Section 2: Use Cases○ Multi-Tenant Data Processing○ Machine Learning / Graph Processing in Spark ○ Example ML + Graph on Business Data

● Questions and Answers● Contact Information

Page 3: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

● Section 1: Understanding the Technology○ Big Picture

■ Why Accumulo■ Why Spark

○ Accumulo■ Key/Value Structure■ Table Structure■ Cell Level Security■ Splits■ Reads (scans)■ Writes (upserts)■ Deletes

○ Spark■ Batch/Streaming■ Machine Learning■ Graph Processing

○ Example Code■ Writing to Accumulo■ Reading from Accumulo■ Shell

Section 1: Understanding the Technology

Page 4: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Big Picture● Accumulo

○ Scalable, sorted, distributed key/value store with cell level security● Spark

○ General compute engine for large-scale data processing■ Batch Processing■ Streaming■ Machine Learning Library■ Graph Processing

● Use Spark for Compute and Accumulo for storage for a security distributed scalable solution

Page 5: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

● Section 1: Understanding the Technology○ Big Picture

■ Why Accumulo■ Why Spark

○ Accumulo■ Key/Value Structure■ Table Structure■ Cell Level Security■ Splits■ Reads (scans)■ Writes (upserts)■ Deletes

○ Spark■ Batch/Streaming■ Machine Learning■ Graph Processing

○ Example Code■ Writing to Accumulo■ Reading from Accumulo■ Shell

Section 1: Understanding the Technology

Page 6: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Key Structure

(image from accumulo.apache.org)

Page 7: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Key StructureAccumuloTableDesign

RDBMTableDesign

Page 8: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Table Structure

● Each table has many tablets (distributed across nodes)● Tablet servers are replicated (default is 3)● Each row resides on the same tablets

○ A Row Id design strategy needs to ensure binning is evenly distributed

○ Each table has “splits” which determine binning○ If Row Ids are still too large; a sharding strategy is

required

Page 9: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Cell Level Security

● Each cell (or field) has its own access control determined by visibility

● Each user has authorizations which correspond to visibilities

● Only fields with visibilities which a user has authorization to access can be retrieved by that user

● Visibilities have limited logic such as AND and OR ○ e.g. private | system public & dna_partner

Page 10: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Splits

● Each table has a default split● Splits can be added to tables● Accumulo auto splits when tablets get to large● Table splits and tablet max size can is configurable● Row ids are generally hashed to support distribution● Example splits based on hashing

○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f

Page 11: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo Reads

● Reads (are scans)○ Scanner○ BatchScanner (parallelizes over ranges)

● MapReduce/Spark○ AccumuloInputFormat (one field at a time)○ AccumuloRowInputFormat (one row at a time)

Page 12: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Writes

● Writes○ Writer○ BatchWriter (parallelizes over tablets)

● MapReduce/Spark○ AccumuloOutputFormat○ AccumuloFileOutputFormat (bulk ingest)

● Both use Mutations to write to accumulo

Page 13: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo: Mutations (write and delete)

● Mutations are used to write and delete● Mutation.put (to write)● Mutation.putDelete (to delete)● Writes are Upserts (insert or updates)

Page 14: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Accumulo

● accumulo.apache.org● Download accumulo● Examples● Documentation

Concerned about scalling; how about 4T Nodes, 70T edges in a graph => see link http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Page 15: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

● Section 1: Understanding the Technology○ Big Picture

■ Why Accumulo■ Why Spark

○ Accumulo■ Key/Value Structure■ Table Structure■ Cell Level Security■ Splits■ Reads (scans)■ Writes (upserts)■ Deletes

○ Spark■ Batch/Streaming■ Machine Learning■ Graph Processing

○ Example Code■ Writing to Accumulo■ Reading from Accumulo■ Shell

Section 1: Understanding the Technology

Page 16: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Spark: MapReduce first

● Hadoop MapReduce (batch processing)○ Mapping○ Reducing○ Chain jobs○ 95% IO (each job must read/write to disk)○ scalable

Page 17: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Spark

● Batch Processing - MapReduce (many more functions)● Streaming - mini batch processing● Machine Learning - MLLib● Graph Processing - GraphX● Many Languages - (Java, Scala, Python, R)

Page 18: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Spark

● spark.apache.org● Download spark ● Example code● Documentation

Page 19: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

○ Spark■ Batch/Streaming■ Machine Learning■ Graph Processing

○ Example Code■ Writing to Accumulo■ Reading from Accumulo■ Shell

Section 1: Understanding the Technology● Section 1: Understanding the Technology

○ Big Picture■ Why Accumulo■ Why Spark

○ Accumulo■ Key/Value Structure■ Table Structure■ Cell Level Security■ Splits■ Reads (scans)■ Writes (upserts)■ Deletes

Page 20: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 1: Example Code

Simple Examples for bookkeeping with spark and accumulo

https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo

Page 21: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Use Case(s) Machine Learning and Graph Processing● Multi-Tenant Data Processing● Machine Learning / Graph Processing in Spark ● Example Usecase of ML + Graph on Business Data

Page 22: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Multi-Tenant Data Processing NeedsCustomer (C) (P) & (C) Provider (P)

Team Customer Private Customer Data shared w/ Provider

Private Provider Data for Economy of Scale

SalesMarketing

IBM IndicatorsRelationshipsClassification

Classification ModelRelationship Graph

MarketingFinance

Apple IndicatorsCorrelationPrediction

Correlation ModelPrediction Model

SalesMarketingFinance

Microsoft IndicatorsRelationshipsCorrelationPrediction

Correlation ModelPrediction ModelRelationship Graph

Finance Google IndicatorsCorrelationPrediction

Correlation ModelPrediction Model

Page 23: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Multi-Tenant Data Processing NeedsCustomer (C) (P) & (C) Provider (P)

C User C Team C Management C ManagementP Analytics

P AnalyticsP Support

CU ManagerCU Employee

CT Sales CM Executive CM ExecutiveCU ManagerPA * / PS *

PA * / PS *

CU ManagerCU Employee

CT Marketing CM Executive CM ExecutiveCU ManagerPA * / PS *

PA * / PS *

CU Employee CT Research CM Executive CM ExecutiveCU ManagerPA * / PS *

PA * / PS *

CU Employee CT Finance CM Executive CM ExecutiveCU ManagerPA * / PS *

PA * / PS *

Page 24: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Multi-Tenant Data Processing Needs● Analyze Sales Team successes (Closed Accounts) to recommend companies

to target for Marketing campaigns. ● Analyze Sales Team User social account against social network users against

recommended companies to create Call Lists ● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads

& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict Sales from current Marketing & Sales activities

Page 25: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Out of the Box : MLLib in Spark● Classification● Regression● Decision Trees● Recommendation● Clustering● Topic Modeling● Feature Transformations● ML Pipelining / Persistence

● “Based on past performance in the companies in the CRM, the most successful sales have come from these categories, so go after these companies.”

Page 26: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Out of the Box : MLLib in Spark● Load Data● Extract Features● Train Model● Find Best Model● Use Model to Predict

Page 27: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: KeystoneML - End to End ML

http://keystone-ml.org/

Page 28: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Out of the Box : GraphX in Spark● PageRank● Connected components● Label propagation● SVD++● Strongly connected components● Triangle count

● “Based on the social graph of sales team members and the companies in your CRM, talk to the companies you are most “closest” to.

Page 29: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Out of the Box : GraphX in Spark● Load Nodes RDD● Load Vertices RDD● Create Graph from

Nodes & Vertices RDD● Run Graph Process /

Query● Get Data

http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

Page 30: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Section 2: Out of the Box : GraphX in Spark● Load Edges into Graph● Run Page Rank● Load Nodes into RDD● Join Users RDD with

Rank

Page 31: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Questions and Answers?

Page 32: Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us

Contact InformationMatthew Purdy

[email protected]● http://www.purdygoodengineering.com ● https://www.linkedin.com/in/matthewpurdy● https://github.com/matthewpurdy

Rahul Singh

[email protected]● http://www.anant.us● http://www.linkedin.com/in/xingh● https://github.com/xingh