Top Banner
Internship final report @Treasure Data Inc. (2016 8/1-9/30) ITO Ryuichi
97

Internship final report@Treasure Data Inc.

Jan 23, 2017

Download

Software

amaya_382
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Internship final report@Treasure Data Inc.

Internship final report @Treasure Data Inc. (2016 8/1-9/30)

ITO Ryuichi

Page 2: Internship final report@Treasure Data Inc.

Outline

• Who am I?

• What I did? • About Hivemall • Benchmark • Add several new features

Page 3: Internship final report@Treasure Data Inc.

Who am I?

Page 4: Internship final report@Treasure Data Inc.

Who am I?

• ITO Ryuichi(@amaya382)

• Graduate School of Information Science and Technology, Osaka University(’16-)

• Accelerating graph processing engine: concurrency control, hardware-aware optimization

• (a little) Natural language processing:conversation system with context consistency

❤ Scala, C#

Page 5: Internship final report@Treasure Data Inc.

What I did?

Page 6: Internship final report@Treasure Data Inc.

What I did?

• About Hivemall • Benchmark • Add several new features

Page 7: Internship final report@Treasure Data Inc.

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Page 8: Internship final report@Treasure Data Inc.

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

Page 9: Internship final report@Treasure Data Inc.

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

Page 10: Internship final report@Treasure Data Inc.

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

Page 11: Internship final report@Treasure Data Inc.

About Hivemall(cont.)

• How does Hivemall work on Hive? • Hivemall is a set of UDFs(User-Defined Functions)

• UDF: projection, one entry -> one entry • UDTF(Table-generating): some entries -> some entries • UDAF(Aggregate): all entries -> one entry

• Define features as UDFs following interfaces in Java prepared by Hive

• And by loading Hivemall jar file, enable to use extra functions in HQL

Page 12: Internship final report@Treasure Data Inc.

About Hivemall(cont.)

• Example: Training by logistic regression

• Only HQL, no need to be familiar with programming. (Already, HQL(Hive) is close to data!)

CREATE TABLE model AS SELECT feature, AVG(weight) AS weight FROM ( SELECT logress(features, label, ...) AS (feature, weight) FROM train_data) t GROUP BY feature

Page 13: Internship final report@Treasure Data Inc.

What I did?

• About Hivemall • Benchmark • Add several new features

Page 14: Internship final report@Treasure Data Inc.

Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

Page 15: Internship final report@Treasure Data Inc.

Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

TriedTried

Page 16: Internship final report@Treasure Data Inc.

Benchmark(cont.)

• Environment • Amazon Web Service

• EMR(Elastic MapReduce) • m3.xlarge*3 + c3.xlarge*3 • Hadoop: Amazon 2.7.2 • Tez: 0.8.4 • Hive: 2.1.0 • Hivemall: 0.4.2-RC2

• Misc. • Basically, using six parallel processing, fitting to #instances

Page 17: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

(Time[sec] / AUC[%])

Page 18: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

10x10x

(Time[sec] / AUC[%])

Page 19: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Page 20: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Page 21: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

1.3x1.3x

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Page 22: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Page 23: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

(Time[sec] / AUC[%])

Page 24: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Page 25: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Page 26: Internship final report@Treasure Data Inc.

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0

+0.4+0.4

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Page 27: Internship final report@Treasure Data Inc.

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Page 28: Internship final report@Treasure Data Inc.

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Page 29: Internship final report@Treasure Data Inc.

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Amazing…

Page 30: Internship final report@Treasure Data Inc.

Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

(Time[sec] / AUC[%])

Page 31: Internship final report@Treasure Data Inc.

Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

(Time[sec] / AUC[%])

Page 32: Internship final report@Treasure Data Inc.

What I did?

• About Hivemall • Benchmark • Add several new features

Page 33: Internship final report@Treasure Data Inc.

What I did?

• About Hivemall • Benchmark • Add several new features

Main topic!

Page 34: Internship final report@Treasure Data Inc.

Add several new features

• systemtest module

• Feature binning

• Feature selection

• Some spark integrations

Page 35: Internship final report@Treasure Data Inc.

Add new features - systemtest

• What’s systemtest? • Testing framework for UDFs

• Also can apply other applications based on UDFs • Already tests exist, not? Why need?

• Yes, but the existing is... • Cannot run on Hive actually, only run as Java programs • Difficult to write coverall tests

• e.g. in UDAF, several work flows depending on a kind of function, data set and environment

• Difficult to use existing resources • Low extendability, etc.

Page 36: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

Page 37: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Page 38: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Page 39: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Useless many conversionsUseless many conversions

Page 40: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Useless many conversionsUseless many conversions

And not run on Hive, only logical test!!

And not run on Hive, only logical test!!

Page 41: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Solution • New module based on JUnit, HiveRunner and td-client-java

• What it can do? • Short and unified initialization • Write and combine HQL • Run local Hive and also remote Treasure Data with the

same code • Testbed is prepared and cleaned up automatically • Easy to use external resources, e.g. TSV file • Literal definition(HQL), but test with debugger • Useful DSL

Page 42: Internship final report@Treasure Data Inc.

Add new features - systemtest(1)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

1. Write tests based on SystemTestRunner interface

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

Page 43: Internship final report@Treasure Data Inc.

Add new features - systemtest(2)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

2. Read initialization and execute via impls of SystemTestRunner

It works based on JUnit @ClassRule

Prepare database specialized for each test class

Use external resources depending on needs

Page 44: Internship final report@Treasure Data Inc.

Add new features - systemtest(3)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

3. Execute first test

It works based on JUnit @Rule

Run as HQL, and check return values

Rewrite DSL & HQL for each env

Page 45: Internship final report@Treasure Data Inc.

Add new features - systemtest(4)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

4. Reset testbeds

It works based on JUnit @Rule

Drop temporary tables

Page 46: Internship final report@Treasure Data Inc.

Add new features - systemtest(5,6…)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

5. Execute second test 6. Reset testbeds …repeat all tests

It works based on JUnit @Rule

Page 47: Internship final report@Treasure Data Inc.

Add new features - systemtest(7)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

7. Finalize test

Drop temporary database and disconnect

It works based on JUnit @ClassRule

Page 48: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

Page 49: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Page 50: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Page 51: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Page 52: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Testbed-specific initializationTestbed-specific initialization

Page 53: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Testbed-specific initializationTestbed-specific initialization

Set common runnerSet common runner

Page 54: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

Page 55: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Page 56: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Page 57: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Page 58: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Page 59: Internship final report@Treasure Data Inc.

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData

Page 60: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

Page 61: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

!

Page 62: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain

Page 63: Internship final report@Treasure Data Inc.

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain

Use HQL and answers written in external filesUse HQL and answers written in external files

Page 64: Internship final report@Treasure Data Inc.

Add new features - systemtest

• More details? • https://github.com/myui/hivemall/issues/323 • https://github.com/myui/hivemall/pull/336 • And systemtest/README.md

Page 65: Internship final report@Treasure Data Inc.

Add new features - feature binning

• What’s feature binning? • A method to divide quantitative variables

into meaningful categorical variables

Page 66: Internship final report@Treasure Data Inc.

Add new features - feature binning

• How does it work? • [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles)

build_bins feature_binning

Page 67: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• Use percentile internally, make all areas uniform

Page 68: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Page 69: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Sometimes made void bins by small or skewed data set

!?!? ->

Page 70: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Sometimes made void bins by small or skewed data set

!?!? ->

Page 71: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Exception!Sometimes made void bins by small or skewed data set

!?!? ->

Page 72: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

Age:17

Page 73: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

bin 0 ->bin 1 ->bin 2 ->

Age:17

Page 74: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

bin 0 ->bin 1 ->bin 2 ->

Age:17

Page 75: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

<here!bin 0 ->bin 1 ->bin 2 ->

Age:17

Page 76: Internship final report@Treasure Data Inc.

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

<here!bin 0 ->bin 1 ->bin 2 ->

Age:17

Page 77: Internship final report@Treasure Data Inc.

Add new features - feature binning

• More details? • https://github.com/myui/hivemall/issues/319 • https://github.com/myui/hivemall/pull/322

Page 78: Internship final report@Treasure Data Inc.

Add new features - feature selection

• What’s feature selection? • A generic term of methods to select meaningful

features • Used to preprocessing of machine learning

• Why used? • Enhance results • Shorten learning time • Make a set of features human-understandable

Page 79: Internship final report@Treasure Data Inc.

Add new features - feature selection

• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.

Page 80: Internship final report@Treasure Data Inc.

Add new features - feature selection

• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.

Implemented

Implemented

Page 81: Internship final report@Treasure Data Inc.

Add new features - feature selection

• Feature selection using Chi-square value • To calc Chi-square value, need both observed

values and expected values(=hypothesis)

• Observed: aggregated features of each class • Expected: assuming each features and each

classes are independent, calc expected values • Calc Chi-square value • Select top-k features

Chi-square

Page 82: Internship final report@Treasure Data Inc.

Add new features - feature selection

• How does it work on Hivemall? • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

• [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Chi-square

Page 83: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Page 84: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Page 85: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Page 86: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Maybe you think matrix multiplication requires repetition…

Chi-square

Page 87: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Calculate incrementally!Maybe you think matrix multiplication requires repetition…

Chi-square

Page 88: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>> • Calculate Chi-square value and p-value

• Calculate p-value by above and Chi-square distribution

Chi-square

Page 89: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square

Page 90: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square

Page 91: Internship final report@Treasure Data Inc.

Add new features - feature selection

• Feature selection using SNR • Aggregate mean and variance of each feature

and each class • When termination, calc Signal Noise Ratio

between all combination of classes, of each feature

• Sum up Signal Noise Ratio each feature

Signal Noise Ratio

Page 92: Internship final report@Treasure Data Inc.

Add new features - feature selection

• How does it work on Hivemall?

• [UDAF] snr(X::array<number>, label::array<int>)::array<double>

Signal Noise Ratio

Page 93: Internship final report@Treasure Data Inc.

Add new features - feature selection

• [UDAF] snr(X::array<number>, label::array<int>)::array<double> • Aggregate variance by Chan’s method

• Calc Signal Noise Ratio and sum them up each features

Signal Noise Ratio

Page 94: Internship final report@Treasure Data Inc.

Add new features - feature selection

• More details? • https://github.com/myui/hivemall/issues/338 • https://github.com/myui/hivemall/pull/352

Page 95: Internship final report@Treasure Data Inc.

Add new features - spark integration

• Integrated feature selection into spark module

• Improved build flow for resolving binary incompatibility between spark-1.6 and spark-2.0

Page 96: Internship final report@Treasure Data Inc.

Thank you for listening!

Page 97: Internship final report@Treasure Data Inc.

Thank you for listening!

Any questions?