Top Banner
Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September 25, 2002 Bologna, Italy
25

Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Predicting Unix Commands With Decision Tables and Decision Trees

Kathleen DurantThird International Conference on

Data Mining Methods and Databases

September 25, 2002Bologna, Italy

Page 2: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

How Predictable Are a User’s Computer Interactions?

Command sequences The time of day The type of computer your using Clusters of command sequences Command typos

Page 3: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Characteristics of the Problem Time sequenced problem with

dependent variables Not a standard classification

problem Predicting a nominal value rather

than a Boolean value Concept shift

Page 4: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Dataset

Davison and Hirsh – Rutgers university Collected history sessions of 77

different users for 2 – 6 months Three categories of users: professor,

graduate, undergraduate Average number of commands per

sessions: 2184 Average number of distinct commands

per session : 77

Page 5: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Rutgers Study

5 different algorithms implemented C4.5 a decision-tree learner An omniscient predictor The most recent command just issued The most frequently used command of the

training set The longest matching prefix to the current

command Most successful – C4.5

Predictive accuracy 38%

Page 6: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Typical History Session

96100720:13:31 green-486 vs100 BLANK96100720:13:31 green-486 vs100 vi96100720:13:31 green-486 vs100 ls96100720:13:47 green-486 vs100 lpr96100720:13:57 green-486 vs100 vi96100720:14:10 green-486 vs100 make96100720:14:33 green-486 vs100 vis96100720:14:46 green-486 vs100 vi

Page 7: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

WEKA System

Provides Learning algorithms Simple format for importing data –

ARFF format Graphical user interface

Page 8: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

History Session in ARFF Format@relation user10@attribute ct-2 {BLANK,vi,ls,lpr,make,vis}@attribute ct-1 {BLANK,vi,ls,lpr,make,vis}@attribute ct0 {vi,ls,lpr,make,vis}@data BLANK,vi,lsvi, ls, lprls,lpr, makelpr, make, vismake, vis, vi

Page 9: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Learning Techniques

Decision tree using 2 previous commands as attributes Minimize size of the tree Maximize information gain

Boosted decision trees - AdaBoost Decision table

Match determined by k nearest neighbors Verification by 10-fold cross validation Verification by splitting data into training/test sets

Match determined by majority

Page 10: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

emacs

time = 0

emacs ls pwd moremake

makepine man

lsemacs pwdgcc lsmorevimake

gcc

time = -1

Learning a Decision Tree

time = -2lssmakes dir

Command Values

Page 11: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Boosting a Decision Tree

Decision Tree

Solution Set

Page 12: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Example

Learning a Decision Table

K - Nearest Neighbors (IBk)

Page 13: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Prediction Metrics

Macro-average – average predictive accuracy per person What was the average predictive

accuracy for the users in the study ? Micro-average – average predictive

accuracy for the commands in the study What percentage of the commands in

the study did we predict correctly?

Page 14: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Macro-average Results

35

36

37

38

39

40

41

42

43

44

45

Ibk Decision Table

Majority Match Decision TablePercentage SplitDecision TableDecision Trees

Boosted DecisionTrees

Page 15: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Micro-average Results

35

36

37

38

39

40

41

42

43

44

45

Ibk Decision Table

Majority Match Decision TablePercentage SplitDecision TableDecision Trees

Boosted DecisionTrees

Page 16: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Results: Decision Trees Decision trees – expected results

Compute-intensive algorithm Predictability results are similar to

simpler algorithms No interesting findings

Duplicated the Rutger’s study results

Page 17: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Results: AdaBoost AdaBoost – very disappointing

Unfortunately none or few boosting iterations performed

Only 12 decision trees were boosted Boosted trees predictability only increased by

2.4% on average Correctly predicted 115 more commands

than decision trees ( out of 118,409 wrongly predicted commands)

Very compute intensive and no substantial increase in predictability percentage

Page 18: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Results: Decision Tables Decision table – satisfactory results

good predictability results relatively speedy Validation is done incrementally Potential candidate for an online

system

Page 19: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Summary of Prediction Results

Ibk decision table produced the highest micro-average

Boosted decision trees produced the highest macro-average

Difference was negligible 1.37% - micro-average 2.21% - macro-average

Page 20: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Findings Ibk decision tables can be used in an

online system Not a compute-intensive algorithm Predictability is better or as good as

decision trees Consistent results achieved on fairly

small log sessions (> 100 commands) No improvement in prediction for larger

log sessions (> 1000 commands) due to concept shift

Page 21: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Summary of Benefits

Automatic typo correction Savings in keystrokes is on

average 30% Given an average command length is

3.77 characters Predicted command can be issued

with 1 keystroke

Page 22: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Questions

Page 23: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

The algorithm.

Let Dt(i) denote the weight of example i in round t.

• Initialization: Assign each example (xi, yi) E the weight D1(i) := 1/n.

• For t = 1 to T:

• Call the weak learning algorithm with example set E and weight s given by Dt.

• Get a weak hypothesis ht : X .

• Update the weights of all examples.

• Output the final hypothesis, generated from the hypotheses of rounds 1 to T.

AdaBoost Description

Page 24: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Complete Set of Results

35

36

37

38

39

40

41

42

43

Decision tableusing Ibk

Decision tableusing majority

match

Decision tableusing

percentage split

Decision trees AdaBoost

Macro-average Micro-average

Page 25: Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Learning a Decision Tree

Command at time = t-2

ls make

Command at t-1

make dir grep

dir …

grep

grep

ls pwd

Command at t-1 Command at t-1

emacs ls pwd

grep

grep

Predicted Commands time = t