Top Banner
Bringing Algebraic Semantics to Apache Mahout Sebastian Schelter Infolunch @ Hasso Plattner Institut, Potsdam 05/07/2014
29

Bringing Algebraic Semantics to Mahout

Aug 11, 2014

Download

Data & Analytics

sscdotopen

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bringing Algebraic Semantics to Mahout

Bringing Algebraic Semantics to Apache Mahout

Sebastian Schelter

Infolunch @ Hasso Plattner Institut, Potsdam

05/07/2014

Page 2: Bringing Algebraic Semantics to Mahout

Mahout: Past & Future

Page 3: Bringing Algebraic Semantics to Mahout

Apache Mahout: History

• library for scalable machine learning (ML)

• started six years ago as ML on MapReduce

• focus on popular ML problems and algorithms– Collaborative Filtering

• (Item-based, User-based, Matrix Factorization)

– Classification• (Naive Bayes, Logistic Regression, Random Forest)

– Clustering • (k-Means, Streaming k-Means)

– Dimensionality Reduction• (Lanczos, Stochastic SVD)

+ in-memory math library (fork of Colt)

• large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)

Page 4: Bringing Algebraic Semantics to Mahout

Apache Mahout: Problems

• Organisational– Onerous to provide support and documentation for scalable ML

(need to convey technical and theoretical background at the same time)

– partly failed to compensate for developers that left the project– barrier for contributions too low

• Technical– MapReduce not well suited for ML

• slow execution, especially for iterations• constrained programming model makes code hard to

write, read and adjust

– different quality of implementations– lack of unified preprocessing machinery

Page 5: Bringing Algebraic Semantics to Mahout

Apache Mahout: Future

• Narrowing of focus– removed a large number of rarely used or barely maintained algorithm

implementations

• Abandonment of MapReduce– will reject new MapReduce implementations– widely used „legacy“ implementations will be maintained

• „Reboot“– usage of modern parallel processing systems which offer richer progamming

model & more efficient execution→ Apache Spark, possibly Stratosphere in the future

– DSL for linear algebraic operations as foundation for new implemenations– automatic optimization and parallelization of programs written in this DSL

Page 6: Bringing Algebraic Semantics to Mahout

Scala & Spark-Bindings

Page 7: Bringing Algebraic Semantics to Mahout

Requirements for an ideal ML environment

1. R/Matlab-like semantics – type system that covers (a) linear algebra,

(b) statistics and (c) data frames

2. Modern programming language qualities– functional programming– object oriented programming– sensible byte code Performance– scriptable and interactive

3. Scalability– automatic distribution and

parallelization with sensible performance

4. Collection of off-the-shelf building blocks and algorithms

5. Visualization

Page 8: Bringing Algebraic Semantics to Mahout

Requirements for an ideal ML environment

1. R/Matlab-like semantics – type system that covers (a) linear algebra,

(b) statistics and (c) data frames

2. Modern programming language qualities– functional programming– object oriented programming– sensible byte code Performance– scriptable and Interactive

3. Scalability– automatic distribution and

parallelization with sensible performance

4. Collection of off-the-shelf building blocks and algorithms

5. Visualization → Mahout‘s new Scala & Spark-bindings address requirements 1(a), 2, 3 and 4

Page 9: Bringing Algebraic Semantics to Mahout

Scala & Spark Bindings

• Scala as programming/scripting environment

• R-like DSL :

val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)

• Algebraic expression optimizer for distributed linear algebra– provides a translation layer to distributed engines

– currently supports Apache Spark only

– might support Stratosphere in the future

q

T

q

TTTssCCBBG

Page 10: Bringing Algebraic Semantics to Mahout

Data Types

• Scalar real values

• In-memory vectors – dense

– 2 types of sparse

• In-memory matrices– sparse and dense

– a number of specialized matrices

• Distributed Row Matrices (DRM)– huge matrix, partitioned by rows

– lives in the main memory of the cluster

– provides small set of parallelized

operations

– lazily evaluated operation execution

val x = 2.367

val v = dvec(1, 0, 5)

val w =

svec((0 -> 1)::(2 -> 5)::Nil)

val A = dense((1, 0, 5),

(2, 1, 4),

(4, 3, 1))

val drmA = drmFromHDFS(...)

Page 11: Bringing Algebraic Semantics to Mahout

Features (1)

• Matrix, vector, scalar operators: in-memory, out-of-core

• Slicing operators

• Assignments (in-memory only)

• Vector-specific

• Summaries

drmA %*% drmB

A %*% x

A.t %*% drmB

A * B

A(5 until 20, 3 until 40)

A(5, ::); A(5, 5)

x(a to b)

A(5, ::) := x

A *= B

A -=: B; 1 /:= x

x dot y; x cross y

A.nrow; x.length;

A.colSums; B.rowMeans

x.sum; A.norm

Page 12: Bringing Algebraic Semantics to Mahout

Features (2)

• solving linear systems

• in-memory decompositions

• out-of-core decompositions

• caching of DRMs

val x = solve(A, b)

val (inMemQ, inMemR) = qr(inMemM)

val ch = chol(inMemM)

val (inMemV, d) = eigen(inMemM)

val (inMemU, inMemV, s) = svd(inMemM)

val (drmQ, inMemR) = thinQR(drmA)

val (drmU, drmV, s) =

dssvd(drmA, k = 50, q = 1)

val drmA_cached = drmA.checkpoint()

drmA_cached.uncache()

Page 13: Bringing Algebraic Semantics to Mahout

Example

Page 14: Bringing Algebraic Semantics to Mahout

Cereals

Name protein fat carbo sugars rating

Apple Cinnamon Cheerios 2 2 10.5 10 29.509541

Cap‘n‘Crunch 1 2 12 12 18.042851

Cocoa Puffs 1 1 12 13 22.736446

Froot Loops 2 1 11 13 32.207582

Honey Graham Ohs 1 2 12 11 21.871292

Wheaties Honey Gold 2 1 16 8 36.187559

Cheerios 6 2 17 1 50.764999

Clusters 3 2 13 7 40.400208

Great Grains Pecan 3 3 13 4 45.811716

http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html

Page 15: Bringing Algebraic Semantics to Mahout

Linear Regression

• Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε

• Goal: find estimate of the parameter vector β that explains the data well

• Cereals example

X = weights of ingredientsy = customer rating

Xy

Page 16: Bringing Algebraic Semantics to Mahout

Data Ingestion

• Usually: load dataset as DRM from a distributed filesystem:

val drmData = drmFromHdfs(...)

• ‚Mimick‘ a large dataset for our example:

val drmData = drmParallelize(dense(

(2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios

(1, 2, 12, 12, 18.042851), // Cap'n'Crunch

(1, 1, 12, 13, 22.736446), // Cocoa Puffs

(2, 1, 11, 13, 32.207582), // Froot Loops

(1, 2, 12, 11, 21.871292), // Honey Graham Ohs

(2, 1, 16, 8, 36.187559), // Wheaties Honey Gold

(6, 2, 17, 1, 50.764999), // Cheerios

(3, 2, 13, 7, 40.400208), // Clusters

(3, 3, 13, 4, 45.811716)), // Great Grains Pecan

numPartitions = 2)

Page 17: Bringing Algebraic Semantics to Mahout

Data Preparation

• Cereals example: target variable y is customer rating, weights of ingredients are features X

• extract X as DRM by slicing, fetch y as in-core vector

val drmX = drmData(::, 0 until 4)

val y = drmData.collect(::, 4)

8117164541333

4002084071323

7649995011726

1875593681612

87129221111221

20758232131112

73644622131211

04285118121221

509541291051022

.

.

.

.

.

.

.

.

..

drmX y

Page 18: Bringing Algebraic Semantics to Mahout

Estimating β

• Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable

• Closed-form expression for estimation of ß as

• Computing XTX and XTy is as simple as typing the formulas:

val drmXtX = drmX.t %*% drmX

val drmXty = drmX %*% y

yXXXTT 1

)(ˆ

Page 19: Bringing Algebraic Semantics to Mahout

Estimating β

• Solve the following linear system to get least-squares estimate of ß

• Fetch XTX and XTy onto the driver and use an in-core solver – assumes XTX fits into memory

– uses analogon to R’s solve() function

val XtX = drmXtX.collect

val Xty = drmXty.collect(::, 0)

val betaHat = solve(XtX, Xty)

yXXXTT

Page 20: Bringing Algebraic Semantics to Mahout

Estimating β

• Solve the following linear system to get least-squares estimate of ß

• Fetch XTX and XTy onto the driver and use an in-memory solver – assumes XTX fits into memory

– uses analogon to R’s solve() function

val XtX = drmXtX.collect

val Xty = drmXty.collect(::, 0)

val betaHat = solve(XtX, Xty)

yXXXTT

→ We have implemented distributed linear regression!

Page 21: Bringing Algebraic Semantics to Mahout

Goodness of fit

• Prediction of the target variable is simple matrix-vector multiplication

• Check L2 norm of the difference between true target variable and our prediction

val yHat = (drmX %*% betaHat).collect(::, 0)

(y - yHat).norm(2)

ˆ Xy

Page 22: Bringing Algebraic Semantics to Mahout

Adding a bias term

• Bias term left out so far– constant factor added to the model, “shifts the line vertically”

• Common trick is to add a column of ones to the feature matrix– bias term will be learned automatically

141333

171323

111726

181612

1111221

1131112

1131211

1121221

11051022 .

41333

71323

11726

81612

111221

131112

131211

121221

105.1022

Page 23: Bringing Algebraic Semantics to Mahout

Adding a bias term

• How do we add a new column to a DRM?

→ mapBlock() allows custom modifications to the matrix

val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) {

case(keys, block) =>

// create a new block with an additional column

val blockWithBiasCol = block.like(block.nrow, block.ncol+1)

// copy data from current block into the new block

blockWithBiasCol(::, 0 until block.ncol) := block

// last column consists of ones

blockWithBiasColumn(::, block.ncol) := 1

keys -> blockWithBiasColumn

}

Page 24: Bringing Algebraic Semantics to Mahout

Under the covers

Page 25: Bringing Algebraic Semantics to Mahout

Underlying systems

• currently: prototype on Apache Spark

– fast and expressive cluster computing system

– general computation graphs, in-memory primitives, rich API, interactive shell

• future: add Stratosphere

– research system developed by TU Berlin, HU Berlin, HPI

– accepted into Apache Incubator recently

– functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution

Page 26: Bringing Algebraic Semantics to Mahout

Optimization

• Execution is defered, user composes logical operators

• Computational actions implicitly trigger optimization (= selection of physical plan) and execution

• Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths

• e. g.: matrix multiplication:– 5 physical operators for drmA %*% drmB– 2 operators for drmA %*% inMemA– 1 operator for drm A %*% x – 1 operator for x %*% drmA

val drmA = drmB.t %*% drmC

drmA.writeDrm(path);

val inMemA =

(drmB.t %*% drmB).collect

Page 27: Bringing Algebraic Semantics to Mahout

Optimization Example (1)

• Computation of XTX in the linear regression example

• Logical optimization:

logical plan gets rewritten to use a special logical operator for Transpose-Times-Self multiplicationvia a rule in the optimizer

val drmXtX = drmX.t %*% drmX

val XtX = drmXtX.collect

OpAt

OpAB

drmX drmX

drmXtX

OpAtA

drmX

drmXtX

Page 28: Bringing Algebraic Semantics to Mahout

Optimization Example (2)

• Mahout computes XTX via row-outer-product formulation of matrix multiplication that can be efficiently executed in a single pass over the row-partitioned X

• optimizer chooses between to physical operators for OpAtA

– standard implementation:

• Map operator builds partial outer products from individual blocks based on an estimate of a good partitioning

• Reduce operator adds them up using a Reduce operator

– AtA_slim for tall but skinny matrices when an intermediate symmetric matrix requiring (X.ncol * X.ncol) / 2 entries fits into memory

m

i

T

ii

TxxXX

0

Page 29: Bringing Algebraic Semantics to Mahout

Thank you. Questions?

Credits:

Design & implementation of the Scala & Spark Bindings and parts of this slideset by Dmitriy Lyubimov [email protected]