Top Banner
Learn more about Advanced Analytics at http://www.alpinenow.com Innovation on DB Tsai [email protected] Sung Chung [email protected] Machine Learning Engineering @AlpineDataLabs August 14, 2014
36

Alpine innovation final v1.0

Oct 30, 2014

Download

Technology

alpinedatalabs

Alpine is constantly innovating, ever since the founding of the company based on in-database analytics that went far beyond traditional, in-memory, code-based desktop applications. This initial innovation built on the work of the MADlib team at Greenplum/Pivotal, ultimately inspired by the work of Joe Hellerstein’s team at UC Berkeley. The team then made all of this functionality available in a simple web interface, which enabled enterprise collaboration and a team-based approach to analytics. Later on, Alpine released its first support for Hadoop, enabling complex analytics on Hadoop without any coding, taking care of all the complexity of MapReduce and Hadoop configuration. Most recently, Alpine has been building new capabilities on top of Spark, to offer Hadoop users a new level of performance and scale.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Innovation on DB [email protected] [email protected]

Machine Learning Engineering @AlpineDataLabs

August 14, 2014

Page 2: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Alpine Data Labs• Advanced Analytic Software Company

– Founded in 2011– Agile Advanced Analytics, Collaboration and Management at Enterprise Scale– Partnerships with EMC, Pivotal, MapR, Cloudera, QlikView and Tableau

• 50+ employees, based in San Francisco– Machine Learning, Statistics and Big Data (Stanford, Berkeley, MIT)

• Growing in excess of 200% YOY with a broad international customer base– Financial Services, Online Media, Government, Retail, Manufacturing…

2

Page 3: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Advanced Analytics on Big Data

Alpine Data Labs. Confidential and Proprietary.

Timeframe of Relevance

Work independently and re-use data scientist work. Collaborate across functions and teams. Iterate quickly.

Scalable Business Analytics

Allowing the Enterprise to manage “Data as an Asset.”

Scale and guard data practices

Data Science Productivity

Work faster, safer, in a more open manner. Industry leading machine learning algorithms built natively for parallel processing.

ALPINE CHORUS 4.0

ENTERPRISE DATA ENVIRONMENT

Data Scientist

Database Analyst

Data Engineer

Business Analyst

Campaign Manager

Sales Division

CustomerSuccess

Product Manager

Page 4: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

TRADITIONAL DESKTOP

IN-DATABASE METHODS

WEB-BASED AND COLLABORATIVE

SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE

ONGOING INNOVATION

The Path to Innovation

Page 5: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

The Path to Innovation

Iterative algorithms scan through the data each time

1st Iteration

Total

2nd Iteration

With Spark, data is cached in memory after first iteration

Quasi-Newton methods enhance in-memory benefits

921s150mm

rows

97s

1st Iteration

Total

2nd Iteration

1st Iteration

Total

2nd Iteration

Page 6: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Machine Learning in the Big Data Era

• Hadoop Map Reduce solutions

• MapReduce scales well for batch processing• Lots of machine learning algorithms are iterative by nature• There are lots of tricks people do, like training with subsamples of data, and

then average the models. Why have big data if you’re only approximating.

+ =

Page 7: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Lightning-fast cluster computing

• Empower users to iterate through the data by utilizing the in-memory cache.

• Logistic regression runs up to 100x faster than Hadoop M/R in memory.

• We’re able to train exact models without doing any approximation.

Page 8: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Why Alpine supports MLlib?

• MLlib is a Spark subproject providing Machine Learning primitives.

• It’s built on Apache Spark, a fast and general engine for large-scale data processing.

• Shipped with Apache Spark since version 0.8• High quality engineering design and effort• More than 50 contributors since July 2014• Alpine is 100% committed to open source to facilitate industry

adoption that are driven by business needs.

Page 9: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

AutoML

• Success of machine learning crucially relies on human machine learning experts, who select appropriate features, workflows, paradigms, algorithms, and their hyper-parameters.

• Even the hyper-parameters can be chosen by grid search with cross-validation, a problem with more than two parameters becomes very difficult and challenging. It’s a non-convex optimization problem.

• There is a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge.

- AutoML workshop @ ICML’14

Page 10: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Random Forest• An ensemble learning method for classification & regression that

operates by constructing a multitude of decision trees at training time.

• A “black box” without too much tuning and it can automatically identify the structure, interactions, and relationships in the data.

• A technique to reduce the variance of single decision tree predictions by averaging the predictions of many de-correlated trees.

• De-correlation is achieved through Bagging and / or randomly selecting features per tree node.

NOTE: Most Kaggle competitions have at least one top entry that heavily uses Random Forests.

Page 11: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Sequoia ForestWhy Sequoia Forest?

MLlib already has a decision tree implementation, but it doesn’t support random features and is not optimized to train on large clusters.

What does Sequoia Forest do?

• Classification and Regression.• Numerical and Categorical Features.

What’s next?

Gradient Boosting

Where can you find?https://github.com/AlpineNow/SparkML2

We’re merging back with MLlib and is licensed under the Apache License.

More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.

Page 12: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark-1157: L-BFGS Optimizer

• No, its not a blender!

Page 13: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

What is Spark-1157: L-BFGS Optimizer

• Merged in Spark 1.0• Popular algorithms for parameter estimation in Machine

Learning.• It’s a quasi-Newton Method.• Hessian matrix of second derivatives doesn't need to be

evaluated directly. • Hessian matrix is approximated using gradient evaluations. • It converges a way faster than the default optimizer in Spark,

Gradient Decent.

Page 14: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 15: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2934: LogisticRegressionWithLBFGS

• Merged in Spark 1.1 • Using L-BFGS to train Logistic Regression instead of default

Gradient Descent. • Users don't have to construct their objective function for

Logistic Regression, and don't have to implement the whole details.

• Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved.

Page 16: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 17: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

0 5 10 15 20 25 30 350.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

L-BFGS Dense FeaturesL-BFGS Sparse FeaturesGD Sparse FeaturesGD Dense Features

Seconds

Log

-Lik

elih

oo

d / N

um

ber

of S

am

ple

sa9a Dataset Benchmark

Page 18: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

a9a Dataset Benchmark

-1 1 3 5 7 9 11 13 150.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

L-BFGSGD

Iterations

Log

-Lik

elih

oo

d / N

um

ber

of S

am

ple

s

Page 19: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse Vector

GD Sparse Vector

Second

Log

-Lik

elih

oo

d / N

um

ber

of S

am

ple

s

rcv1 Dataset Benchmark

Page 20: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

news20 Dataset Benchmark

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse VectorGD Sparse Vector

Second

Log

-Lik

elih

oo

d / N

um

ber

of S

am

ple

s

Page 21: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2979: Improve the convergence rate by standardizing the training features

Merged in Spark 1.1 Due to the invariance property of MLEs, the scale of your inputs are irrelevant. However, the optimizer will not be happy with poor condition numbers which

can often be improved by scaling. The model is trained in the scaled space, but the coefficients are converted to

original space; as a result, it's transparent to users. Without this, some training datasets mixing the columns with different scales

may not be able to converge. Scikit and glmnet package also standardize the features before training to

improve the convergence. Only enable in Logistic Regression for now.

Page 22: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2272: Transformer

A spark, the soul of a transformer

Page 23: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2272: Transformer Merged in Spark 1.1 MLlib data preprocessing pipeline. StandardScaler

Standardize features by removing the mean and scaling to unit variance. RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear

models typically works better with zero mean and unit variance. Normalizer

Normalizes samples individually to unit L^n norm. Common operation for text classification or clustering for instance. For example, the dot product of two l2-normalized TF-IDF vectors is the cosine

similarity of the vectors.

Page 24: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

StandardScaler

Page 25: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Normalizer

Page 26: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Merged in Spark 1.1

Online algorithms for computing the mean, variance, min, and max in a streaming fashion.

Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer.

A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample.

SPARK-1969: Online summarizer

Two-pass algorithm Naive algorithm

Page 27: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 28: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 29: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Merged in Spark 1.1

Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.

Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.

That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.

Scala syntax sugar comparators are implemented using implicit conversion allowing developers to write unittest easier.

SPARK-2479: MLlib UnitTests

Page 30: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 31: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-1892: OWL-QN Optimizerongoing work

It extends L-BFGS to handle L2 and L1 regularizations together

(balanced with alpha as in elastic nets) We fixed couple issues #247 in Breeze's OWLQN

implementation, and this work is based on that. Blocked by SPARK-2505

Page 32: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2505: Weighted Regularizationongoing work

Each components of weights can be penalized differently. We can exclude intercept from regularization in this framework. Decoupling regularization from the raw gradient update which is not

used in other optimization schemes. Allow various update/learning rate schemes (adagrad, normalized

adaptive gradient, etc) to be applied independent of the regularization

Smooth and L1 regularization will be handled differently in optimizer.

Page 33: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2309: Multinomial Logistic Regressionongoing work

For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions.

As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features.

MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes.

Expected to be merged in next release of MLlib, Spark 1.2

Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

Page 34: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop, Javascripts, D3.js etc.

Actively involved in the open source community: almost of all our newly developed algorithms in Spark will be contributed back to MLLib.

Actively developing on application to/from Spark Yarn communication infrastructure (application lifecycle, error reporting, progress monitoring and interactive command etc)

In addition to Spark, we are the maintainer of several open source projects including Chorus, SBT plugin for JUnit test Listener, and Akka-based R engine.

Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc…

Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng (Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera), etc…

We’re open source friendly and tech driven!

Page 35: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

We're hiring!

Machine Learning Engineer Data Scientist UI/UX Engineer Platform Engineer Automation Test Engineer

Shoot me an email at [email protected]

Page 36: Alpine innovation final v1.0

Learn more about Advanced Analytics at http://www.alpinenow.com

For more information, contact us

1550 Bryant Street Suite 1000San Francisco, CA 94103USA

+1 (877) 542-0062

www.alpinenow.com

Get Started Today!

http://start.alpinenow.com