Internship final report@Treasure Data Inc.

Internship final report @Treasure Data Inc. (2016 8/1-9/30)

ITO Ryuichi

Outline

• Who am I?

• What I did? • About Hivemall • Benchmark • Add several new features

Who am I?

Who am I?

• ITO Ryuichi(@amaya382)

• Graduate School of Information Science and Technology, Osaka University(’16-)

• Accelerating graph processing engine: concurrency control, hardware-aware optimization

• (a little) Natural language processing:conversation system with context consistency

❤ Scala, C#

What I did?

What I did?

• About Hivemall • Benchmark • Add several new features

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

About Hivemall






Cute Logo!

About Hivemall






Cute Logo!

About Hivemall






Cute Logo!

About Hivemall(cont.)

• How does Hivemall work on Hive? • Hivemall is a set of UDFs(User-Defined Functions)

• UDF: projection, one entry -> one entry • UDTF(Table-generating): some entries -> some entries • UDAF(Aggregate): all entries -> one entry

• Define features as UDFs following interfaces in Java prepared by Hive

• And by loading Hivemall jar file, enable to use extra functions in HQL

About Hivemall(cont.)

• Example: Training by logistic regression

• Only HQL, no need to be familiar with programming. (Already, HQL(Hive) is close to data!)

CREATE TABLE model AS SELECT feature, AVG(weight) AS weight FROM ( SELECT logress(features, label, ...) AS (feature, weight) FROM train_data) t GROUP BY feature

What I did?


Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

https://github.com/szilard/benchm-ml

Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

TriedTried

https://github.com/szilard/benchm-ml

Benchmark(cont.)

• Environment • Amazon Web Service

• EMR(Elastic MapReduce) • m3.xlarge*3 + c3.xlarge*3 • Hadoop: Amazon 2.7.2 • Tez: 0.8.4 • Hive: 2.1.0 • Hivemall: 0.4.2-RC2

• Misc. • Basically, using six parallel processing, fitting to #instances

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

(Time[sec] / AUC[%])

https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall





10x10x







10x10x 12.5x12.5x







✖✖10x10x 12.5x12.5x







✖✖

1.3x1.3x

10x10x 12.5x12.5x







✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x







✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x







✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x


High initial overhead caused by Hive






✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0








✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0

+0.4+0.4




Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall


https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall










Amazing…


Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall


https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall


https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

What I did?


What I did?


Main topic!

Add several new features

• systemtest module

• Feature binning

• Feature selection

• Some spark integrations

Add new features - systemtest

• What’s systemtest? • Testing framework for UDFs

• Also can apply other applications based on UDFs • Already tests exist, not? Why need?

• Yes, but the existing is... • Cannot run on Hive actually, only run as Java programs • Difficult to write coverall tests

• e.g. in UDAF, several work flows depending on a kind of function, data set and environment

• Difficult to use existing resources • Low extendability, etc.


• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);



omitt

ed a

lot

→

→



omitt

ed a

lot

Useless and long initializationUseless and long initialization

→

→



omitt

ed a

lot


→

→

Useless many conversionsUseless many conversions



omitt

ed a

lot


→

→

Useless many conversionsUseless many conversions

And not run on Hive, only logical test!!

And not run on Hive, only logical test!!


• Solution • New module based on JUnit, HiveRunner and td-client-java

• What it can do? • Short and unified initialization • Write and combine HQL • Run local Hive and also remote Treasure Data with the

same code • Testbed is prepared and cleaned up automatically • Easy to use external resources, e.g. TSV file • Literal definition(HQL), but test with debugger • Useful DSL

Add new features - systemtest(1)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

1. Write tests based on SystemTestRunner interface

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner



SystemTestRunner

TDSystemTestRunner


Test code

User


SystemTestTeam


2. Read initialization and execute via impls of SystemTestRunner

It works based on JUnit @ClassRule

Prepare database specialized for each test class

Use external resources depending on needs



SystemTestRunner

TDSystemTestRunner


Test code

User


SystemTestTeam


3. Execute first test

It works based on JUnit @Rule

Run as HQL, and check return values

Rewrite DSL & HQL for each env



SystemTestRunner

TDSystemTestRunner


Test code

User


SystemTestTeam


4. Reset testbeds


Drop temporary tables

Add new features - systemtest(5,6…)


SystemTestRunner

TDSystemTestRunner


Test code

User


SystemTestTeam


5. Execute second test 6. Reset testbeds …repeat all tests




SystemTestRunner

TDSystemTestRunner


Test code

User


SystemTestTeam


7. Finalize test

Drop temporary database and disconnect

It works based on JUnit @ClassRule


• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);



no o

mis

sion

!



no o

mis

sion

!

Common initialization with external dataCommon initialization with external data



no o

mis

sion

!




no o

mis

sion

!


Testbed-specific initializationTestbed-specific initialization



no o

mis

sion

!


Testbed-specific initializationTestbed-specific initialization

Set common runnerSet common runner

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }



no o

mis

sion

!



no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init



no o

mis

sion

!


Run on HiveRunnerRun on HiveRunner



no o

mis

sion

!





no o

mis

sion

!



Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData


• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }




no o

mis

sion

!




no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain




no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain

Use HQL and answers written in external filesUse HQL and answers written in external files


• More details? • https://github.com/myui/hivemall/issues/323 • https://github.com/myui/hivemall/pull/336 • And systemtest/README.md

https://github.com/myui/hivemall/issues/323

https://github.com/myui/hivemall/pull/336

Add new features - feature binning

• What’s feature binning? • A method to divide quantitative variables

into meaningful categorical variables


• How does it work? • [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles)

build_bins feature_binning


• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• Use percentile internally, make all areas uniform



• What’s auto_shrink?




Sometimes made void bins by small or skewed data set

!?!? ->




Sometimes made void bins by small or skewed data set

!?!? ->




Exception!Sometimes made void bins by small or skewed data set

!?!? ->


• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

Age:17



feature_binning

bin 0 ->bin 1 ->bin 2 ->

Age:17



feature_binning

17 is between -Infinity and 18.0 …

bin 0 ->bin 1 ->bin 2 ->

Age:17



feature_binning


<here!bin 0 ->bin 1 ->bin 2 ->

Age:17



feature_binning


<here!bin 0 ->bin 1 ->bin 2 ->

Age:17


• More details? • https://github.com/myui/hivemall/issues/319 • https://github.com/myui/hivemall/pull/322



Add new features - feature selection

• What’s feature selection? • A generic term of methods to select meaningful

features • Used to preprocessing of machine learning

• Why used? • Enhance results • Shorten learning time • Make a set of features human-understandable


• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.


• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.

Implemented

Implemented


• Feature selection using Chi-square value • To calc Chi-square value, need both observed

values and expected values(=hypothesis)

• Observed: aggregated features of each class • Expected: assuming each features and each

classes are independent, calc expected values • Calc Chi-square value • Select top-k features

Chi-square


• How does it work on Hivemall? • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

• [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Chi-square


• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square



YXT

Chi-square



YXT

Chi-square



YXT

Maybe you think matrix multiplication requires repetition…

Chi-square



YXT

Calculate incrementally!Maybe you think matrix multiplication requires repetition…

Chi-square


• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>> • Calculate Chi-square value and p-value

•

• Calculate p-value by above and Chi-square distribution

Chi-square


• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square


• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square


• Feature selection using SNR • Aggregate mean and variance of each feature

and each class • When termination, calc Signal Noise Ratio

between all combination of classes, of each feature

• Sum up Signal Noise Ratio each feature

Signal Noise Ratio


• How does it work on Hivemall?

• [UDAF] snr(X::array<number>, label::array<int>)::array<double>

Signal Noise Ratio


• [UDAF] snr(X::array<number>, label::array<int>)::array<double> • Aggregate variance by Chan’s method

• Calc Signal Noise Ratio and sum them up each features

Signal Noise Ratio


• More details? • https://github.com/myui/hivemall/issues/338 • https://github.com/myui/hivemall/pull/352



Add new features - spark integration

• Integrated feature selection into spark module

• Improved build flow for resolving binary incompatibility between spark-1.6 and spark-2.0

Thank you for listening!

Thank you for listening!

Any questions?

Internship final report@Treasure Data Inc.

Software