Introduction to Machine Learning for Oracle Database Professionals

Post on 25-May-2015

934 Views

Category:

Technology

9 Downloads

Preview:

Click to see full reader

DESCRIPTION

Basic Machine Learning introduction for Oracle folks.

Transcript

Practical Machine Learning for DBAs

Alex Gorbachev

Las Vegas, NV

April 2014

Alex Gorbachev• Chief Technology Officer at Pythian • Blogger • Cloudera Champion of Big Data • OakTable Network member • Oracle ACE Director • Founder of BattleAgainstAnyGuess.com • Founder of Sydney Oracle Meetup • IOUG Director of Communities • EVP, Ottawa Oracle User Group

Agenda

• What’s Machine Learning – Typical Machine Learning applications

• Why using Oracle Database for Machine Learning

• Practical examples – Classifying PL/SQL code – Classifying database schemas into good

and bad – SQL statements clustering – Detecting anomalies in database

workload

What is Machine Learning?

data magic

scientific data

analysis

modern practical

AI

building simplified models of the universe

using probabilistic models

Tom Mitchell’s definition

• Machine Learning is the study of computer algorithms that improve automatically through experience.

!• A computer program is said to learn from

experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Why is it useful?

Why is it useful?

Why is it useful?

Why is it useful?

Classes of ML algorithms

• Supervised learning – Input: data + known facts; Output - predictions

• Unsupervised learning – Input: data; Output – hypothesis

!– Other less common algorithms such as reinforcement

learning, recommenders and etc

Supervised Learning: Linear Regression

Supervised Learning: Classification

Unsupervised Learning: Clustering

Unsupervised Learning: Anomaly Detection

Machine Learning workflow

• Gather

• Clean & transform

• Explore

• Model

• Interpret

• Produce value

} today’s focus

Why Machine Learning in Oracle Database?

Machine Learning in Oracle DB?

• That’s where the data is

• Data in an RDBMS is often clean

• Easy to transform data with SQL

• Powerful algorithms implemented – Oracle Data Mining option

– Analytic SQL

Machine Learning by example

Applying Machine Learning

to the business of DBAs

Problem: Detect bad PL/SQL

• Goal: automated PL/SQL code grading – Classify as Good or Bad

• Typical classification task – Assignment of labels to the set of unlabeled items

based on prior observations

Classification process

• Parse input data

• Extract features – Manually or automatically or they are clearly defined (if

row is an item, columns may be features)

• Train – calculate model based on labeled input

• Verify – test model on labeled input

• Apply labels to unlabeled input

!• Classification is supervised learning

Features definition - easy task?

Kittens vs …

Kittens vs Puppies

PL/SQL code features

• Automatically extract words from the text as features (tokenize) – EASY TO AUTOMATE

• Assign features intelligently – Code size

– Author

– Percent of comment lines

– Presence of specific code patterns

– DIFFICULT TO AUTOMATE

Classification model workflow

1. Create Oracle Text policy (define lexer)

2. Configure and build the model on training set

3. Apply model to the testing set

4. Assess model performance

5. Adjust model settings/features/size and repeat

Basic probability lesson

• p(A) is the probability that A is true

A is false

A is true

Area is 1

Basic probability lesson

• p(A) is the probability that A is true

• Axioms of Probability

Basic probability lesson

• p(A) is the probability that A is true

• Axioms of Probability

!!!!

• Bayes Law

How Bayes Law can work for us?

!!!

• A – presence of a feature like WHEN OTHERS THEN NULL in PL/SQL

• B – bad PL/SQL code

A

B

Area is 1B|A

PL/SQL data source

• OBJECT_ID – case ID

• CODE – text column

• TARGET_VALUE – 0 is good and 1 is bad

• Training set – where mod(object_id, 10) < 5

• Testing set – where mod(object_id, 10) >= 5

Oracle Text policybegin begin ctx_ddl.drop_policy('plsql_nb_policy'); exception when others then null; end; begin ctx_ddl.drop_preference('plsql_nb_lexer'); exception when others then null; end; ctx_ddl.create_preference ('plsql_nb_lexer’, 'BASIC_LEXER'); ctx_ddl.create_policy ('plsql_nb_policy', lexer=>'plsql_nb_lexer'); end; /

Model settingsCREATE TABLE plsql_nb_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(4000)); BEGIN -- Populate settings table INSERT INTO plsql_svm_settings VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_naive_bayes); INSERT INTO plsql_nb_settings VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on); INSERT INTO plsql_nb_settings VALUES (dbms_data_mining.odms_text_policy_name, 'plsql_nb_policy'); -- INSERT INTO plsql_nb_settings VALUES -- (dbms_data_mining.NABS_PAIRWISE_THRESHOLD,0.01); -- INSERT INTO plsql_nb_settings VALUES -- (dbms_data_mining.NABS_SINGLETON_THRESHOLD,0.01); COMMIT; END; /

Build modelDECLARE xformlist dbms_data_mining_transform.TRANSFORM_LIST; BEGIN BEGIN DBMS_DATA_MINING.DROP_MODEL('PLSQL_NB'); EXCEPTION WHEN OTHERS THEN NULL; END; ! dbms_data_mining_transform.SET_TRANSFORM( xformlist, 'code', null, 'code', null, 'TEXT(TOKEN_TYPE:NORMAL)'); ! DBMS_DATA_MINING.CREATE_MODEL( model_name => 'PLSQL_NB', mining_function => dbms_data_mining.classification, data_table_name => 'plsql_build', case_id_column_name => 'object_id', target_column_name => 'target_value', settings_table_name => 'plsql_nb_settings', xform_list => xformlist); END; /

Test modelSELECT target_value AS actual_target, PREDICTION(plsql_nb USING *) AS predicted_target, COUNT(*) AS cases_count FROM plsql_test GROUP BY target_value, PREDICTION(plsql_nb USING *) ORDER BY 1, 2;

Demo

40

Skyline and Oculus by Etsy blackbox anomaly detection

41

Thanks and Q&A

Contact info

gorbachev@pythian.com

+1-877-PYTHIAN

To follow us

pythian.com/blog

@alexgorbachev @pythian

linkedin.com/company/pythian

top related