Top Banner
The Titanic: Machine Learning from Disaster Data Mining and Machine Learning. Winter 2014. Final Project Jean Callao | Michelle Darling | Paul Marxhausen
66
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Final pink panthers_03_30

The Titanic:

Machine Learning

from Disaster

Data Mining and Machine Learning. Winter 2014. Final Project

Jean Callao | Michelle Darling | Paul Marxhausen

Page 2: Final pink panthers_03_30

AGENDA

In depth analysis: by Jean Callao

• Logistic Regression: glm

• Tree-based methods: rpart, ctree

In depth analysis: by Paul Marxhausen

• Ensemble Methods: randomForest, cForest

Summary: by Michelle Darling

• Data Visualization

• Machine Learning Kaggle Results

Page 3: Final pink panthers_03_30

Titanic: Machine Learning from Disaster

Why we picked this project:

● Historical context to understand "What does the data mean?"

● Learn one data set well, and then apply different algorithms and modelling tools.

● Practice the steps of data analysis:

○ Data exploration and visualization.

○ Model selection, building and testing.

● Prize: $0 + "knowledge & confidence" to go on to more challenging data science problems.

kaggle.com provides:

Online data science competitions.

Structured problems, tutorials, help forums and discussion groups.

Easy, consistent way to test models and track results.

>>> Focus <<<

Page 4: Final pink panthers_03_30

April 1912

The Titanic Disaster

Page 5: Final pink panthers_03_30

RMS Titanic, April 1912

A priori knowledge from problem domain

What factors contributed to survival?

Gender, Age, Passenger Class, Fare, Family

More likely to survive

• Females

• Children, Adults<50

• 1st Class

• Paid higher fares

• Travelling with family

More likely to perish

• Males

• Adults >50

• 2nd, 3rd class

• Paid lower fares

• Travelling alone

• Immigrants

Page 6: Final pink panthers_03_30

Titanic DatasetPredictor & Target Variables

ResponseVARIABLE

Survived(1 = Yes; 0 = No)

PredictorVariables DESCRIPTION

Pclass Passenger Class (1=1st; 2=2nd; 3=3rd)Name Passenger NameSex Sex ("male", "female")Age Age (Numeric fraction e.g., 1.5)Fare Passenger FareSibsp Number of Siblings/Spouses AboardParch Number of Parents/Children AboardTicket Ticket NumberCabin Cabin Embarked Port of Embarkation

(C=Cherbourg; Q=Queenstown; S=Southampton)

QUANTITATIVE Variables; the rest are QUALITATIVE.

Page 7: Final pink panthers_03_30

Feature Engineering

Data relating to one's location on the ship

data$cabin.last.digit <- str_sub(data$Cabin, -1)

data$Side <- "Unknown”

data$Side[which(isEven(data$cabin.last.digit))] <-"port”

data$Side[which(isOdd(data$cabin.last.digit))] <-"starboard”

Classifying Fares

combi$Fare2 <- '30+'

combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <-'20-30'

combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <-'10-20’

combi$Fare2[combi$Fare < 10] <- '<10'

Title - Extract from name to find wealthy passengers:

combi$Title[combi$Title %in% c('Mme', 'Mlle')] <-'Mlle‘

combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'

combi$Title[combi$Title %in% c('Capt', 'Col', 'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<-'Noble’

FamilySize - Combining spouse, siblings and parents

combi$FamilySize <- combi$SibSp + combi$Parch + 1

Page 8: Final pink panthers_03_30

Decision Trees and

Logistic Regression

Presented by Jean Callao

Page 9: Final pink panthers_03_30

Decision Trees• A decision tree is a simple, but

powerful form of multiple variable

analysis. It displays a tree-like

graph of decisions and their

possible consequences.

• Recursive Partitioning-> at each

step, we identify a question that

we use to partition the data.

Advantages:

• Data-driven: Makes no prior

assumptions; selects significant predictors

based on the greatest information gain.

• Flexible: No data pre-processing needed!

Handles numeric and categorical data.

• Easy to interpret and explain to others.

Page 10: Final pink panthers_03_30

Decision Tree with New Variables

tree <- rpart(Survived~ Class + Sex + Age + SibSp + Parch + Fare + Title + Side,

data=train, method="class", control = rpart.control(minsplit = 0, minbucket = 0, maxdepth = 10))

fancyRpartPlot(tree)

Prediction <- predict(tree, test, type = "class")

table(Prediction)

Perished Survived

262 156

Page 11: Final pink panthers_03_30

Decision Tree with New Variables

Root node-> 62% perished, 38% didn’t perished

Mr or Noble-> 84% perished, 16% didn’t perished

Not a Mr or Noble-> 28% didn't survive, 72% survived

3rd class-> 52% died, 48% didn’t died

Not a 3rd class-> 5% didn't survive, 95% survived

Pay >=$23-> 91% perished, 9% didn’t perished

Pay <=$23-> 38% didn't survive, 62% survived

If >=36 yrs-> 86% died, 14% didn't died

If <=36 yrs-> 36% didn't survive, 64% survived

Page 12: Final pink panthers_03_30

Overfitted rpart Decision Tree

Disadvantages of rpart:

• Can suffer from:

o High Variance

o High Bias

• Decision tree algorithms can result in

overly complex or overfitted trees.

Function ctree() in package party

addresses these weaknesses by providing:

• Unbiased variable selection

• Statistical stopping rules to

optimize tree growth.

Page 13: Final pink panthers_03_30

Conditional Tree: ctree

train.ctree <- ctree(Survived ~ Class + Sex + Age + Fare +Title + Side,data=train)

plot(train.ctree)

Prediction2 <- predict(train.ctree , newdata=test, type="response")

table(Prediction2)

Perished Survived

256 162

Page 14: Final pink panthers_03_30

Mr or Noble-> Side-> Port or Starboard:

40% of surviving, 60% of dying

Mr or Noble-> Side-> Unknown:

16% of surviving, 84% of dying

Not a Mr or Noble-> 1st or 2nd Class:

98% of surviving, 2% of dying

Not a Mr or Noble-> 3rd Class-> Pay $23.25

61% of surviving, 39% of dying

Not a Mr or Noble-> 3rd Class-> Pay > $23.25

14% of surviving, 86% of dying

Conditional Tree: ctree

Page 15: Final pink panthers_03_30

Logistic Regression

Least squares linear

regression

Predicted probabilities can

be greater than 1 or less

than 0 if used for

classification!

LOGISTIC REGRESSION

• Used for binary

qualitative response.

• Using logit ensures all

probabilities are between

1 and 0 only.

Why use Logistic

Regression?

Allows us to establish a

relationship between a binary

outcome variable and a group

of predictor variables. Can be

used as:

• CLASSIFICATION METHOD:

Classifies binary response (E.g.

Yes/No, Pass/Fail,

Survived/Perished)

• REGRESSION METHOD:

Calculates probability (0.0 to

1.0) of the response.

Page 16: Final pink panthers_03_30

The “logit” model solves the problem:

Where:

• “p” is the probability that Y

for cases equals 1, p (Y=1).

• “1- p” is the probability that

Y for cases equals 0.

Transformed, the “log odds” are linear.

0 1

0 1

Linear CombiantionLog Odds(logit)

0 1

/ 1

or

log / 1 e

B B X

ln p p B B

p

y

X

p B B X

0 1

0 1

Solving....

/ 1

/ 1

B B X

B B X

Odds

e p p

p p e

Probability (Logistic function): that

Produces an S-shape curve.

Page 17: Final pink panthers_03_30

Confirming “women &

children first” policy

Titanic.glm <- glm(Survived~ I(Sex=="female") + Class + I(Age<=10) + Embarked + Fare2,

data = train, family=binomial("logit"))

table(test$Survived)

Perished Survived

252 166

summary(Titanic.glm)

The logistic regression coefficients give the

change in the log odds of the outcome for a

one unit increase in the predictor variable.

Page 18: Final pink panthers_03_30

Making Predictions

Sex==female who is 10 yrs old has an estimated

survival probability of:

2nd class men who paid 20 dollars for a ticket has an

estimated survival probability of:

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)0.70

1

ep

e

12.3958 2.6816(1) 1.6133(10)

12.3958 2.6816(1) 1.6133(10)0.99

1

ep

e

Page 19: Final pink panthers_03_30

Interpreting Coefficients…

summary(Titanic.glm)

Estimate > 0 higher probability of

surviving

Estimate < 0 lower probability of

surviving

Page 20: Final pink panthers_03_30

Passengers travelling with relatives

have higher chances of survival.

Titanic.glm2<- glm(Survived ~ Class+I(FamilySize>=2) + Parch+I(SibSp>=2),data = train, family=binomial("logit"))

table(test$Survived)

Perished Survived

276 142

summary(Titanic.glm2)

We see that PClass is a strong

predictor supporting the

hypotheses about:

• location on the ship

• lifeboat access.

Page 21: Final pink panthers_03_30

First class adult males

have lower chances of survival

Titanic.glm3<- glm(Survived ~ Class + I(Title=="Mr")+ I(Title=="Noble") + I(Age>=30 & Age<=50)+I(Fare>=27),data = train, family=binomial("logit"))

table(test$Survived)

Perished Survived

239 179

summary(Titanic.glm3)

Page 22: Final pink panthers_03_30

"Any data relating to one's location on the ship could

prove helpful to survival predictions…"

Page 23: Final pink panthers_03_30

First class adult males had

lower chances of survival

summary(Titanic.glm3)Those in upper decks (1st class) had more

timely, accurate information and shorter

journey to the lifeboats… Yet why did 1st

Class Males have lower survival rates?

Possible explanation:

• 1st Class Males were expected to be

"gentlemen" and perish with the ship.

"No woman shall be left aboard this ship

because Ben Guggenheim was a coward."

• 1st Class Male Survivors were

condemned by society:

> Bruce Ismay – had to resign as

Chairman of White Star Line.

> William Carter – divorced by wife.

Page 24: Final pink panthers_03_30

Third class adult males had

lower chance of survival

summary(Titanic.glm4)

Those located in the bow or

lower decks (3rd Class) had less

chance of survival.

Titanic.gml4 <- glm(Survived ~ Class+I(Age>=30 & Age<=65) +I(Title=="Mr"& Class=="Third")+I(Fare<=10), data = train, family=binomial("logit"))

table(test$Survived)

Perished Survived

258 160

Page 25: Final pink panthers_03_30

Ensemble Methods:

randomForest and cforest

Presented by Paul Marxhausen

Page 26: Final pink panthers_03_30

Random ForestsAdvantages:

• Easy to use: can be used quite efficiently

with default parameters.

• Ideal for people without a deep

background in statistics.

• Produces fairly strong predictions with

only a small amount of coding.

• An example of an ENSEMBLE

METHOD -- combines multiple

models to produce one result.

• Unlike single decision trees which

can suffer from high variance or

high bias, Random Forests use

random sampling and

averaging to find a natural

balance between the two

extremes.

Page 27: Final pink panthers_03_30

Random Forests: Data pre-processing

Disadvantages:

• Data has to be pre-processed to

remove NAs, NULLs, blanks.

• Factor levels must be <22.

• We have to fix Age, Fare,

Embarked and FamilyID to

meet these requirements.

DATA PRE-PROCESSING TASKS

• Age

• Fare

• Embarked

• FamilyID

# Fill in Fare NAs

summary(combi$Fare)

which(is.na(combi$Fare))

combi$Fare[1044] <-median(combi$Fare, na.rm=TRUE)

Page 28: Final pink panthers_03_30

Model: RF using ‘randomForest’ package

# Build Random Forest Ensemble

set.seed(415)

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000)

# Now let's make a prediction

Prediction <- predict(fit, test)

kaggle.com score

0.81818

Page 29: Final pink panthers_03_30

Models: RFs using party package

# Build condition inference tree forest

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=4000, mtry=2))

# Now let's make a prediction and write a submission file

Prediction <- predict(fit, test, OOB=TRUE, type = "response")

kaggle.com score

0.81818

Page 30: Final pink panthers_03_30

randomForest vs. party

randomForest package• randomForest(…) function

• mtry is floor(sqrt(p)), which is

the number of features to

randomly select at each split.

• randomForest is

computationally faster.

• Popular in applied research

party package• cForest(…) function

• mtry set to the number 5 by

default for technical reasons

• Resulting forests are unbiased if

the predictor variables are of

different types.

• Importance manager: helps

evaluate the importance of

correlated predictor variables.

Page 31: Final pink panthers_03_30

Model Description Result

fit <-cForest Changed ntree from 2000 to

4000, and mtry from 3 to 2.0.81818

fit <-randomForest(…)

Traditional Random Forest (randomForest package)

0.81818

fit <- cForest(…) Conditional Inference tree (party package)

0.81340

Ensemble Methods: kaggle results

Page 32: Final pink panthers_03_30

Summary:

Data Visualization

Algorithms & kaggle results

by Michelle Darlingwww.datastudentblog.wordpress.com

Page 33: Final pink panthers_03_30

Data Visualization

Summary

1. Created Conceptual Data Model• to understand denormalized data file.

2. Tried lots of visualizations:

• Categorical vs. Continuous

• Uni-, Bi- and Multivariate

3. Compared datasets:

• Titanic vs. train vs. test ARE similar

4. Created rule-based models using the

most significant predictors:• Sex == "female"

• Sex=="female OR Age <10

• Sex:Child:Fare:FamilySize

Data Visualization prototyping tools:

• MS Excel

• wordle.net

• Google Fusion

• R {rattle} package

PORTEmbarkedS=SouthhamptonC=CherbourgQ=Queenstown

TICKETTicketPclassCabin

PASSENGERPassengerIDNameAgeSibSpParchFareSurvived

Page 34: Final pink panthers_03_30

Do family members affect survival?

> table(Survived, Parch)Parch

Survived 0 1 2 3 4 5 60 445 53 40 2 4 4 11 233 65 40 3 0 1 0

> table(Survived, SibSp)SibSp

Survived 0 1 2 3 4 5 80 398 97 15 12 15 5 71 210 112 13 4 3 0 0

Survival is higher for passengers with Parch==3 (60%), or SibSp==1 (54%)

Page 35: Final pink panthers_03_30

What is the relationship between:Embarked, Pclass, Ticket, Fare?

Cherbourg,

FranceSouthhampton, EnglandQueenstown,

Ireland

All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).

But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).

Based on this, Fare is likely a stronger predictor of survival than Embarked.

Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)

Page 36: Final pink panthers_03_30

Text Analysis of

Passenger Name

SURVIVORS PERISHED

Word Clouds created in www.wordle.net for Survivors$Name and Perished$NameSurvivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]

Sex ("male" vs. "female") is

an important predictor of survival.

Page 37: Final pink panthers_03_30

Google Fusion TablesGeospatial Heatmap, Network Diagrams

Google Fusion Heatmap

GEOCODED by Embarkation Port:

• Southampton, UK -- 644 pasengers

• Cherbourg, France -- 168 passengers

• Queenstown, Ireland – 77 passengers

No Lifeboat

SURVIVORSPERISHED

Network Diagrams showing

Lifeboats (orange) vs. Embarkation Port (blue)

Based on external data (Encyclopedia Titanica)

imported into Google Fusion Tables.

Page 38: Final pink panthers_03_30

Data Visualization in R

R Visualization Packages:

• Base R: plot, barplot, boxplot, hist, dotchart, heatmap, pairs

• ggplot2: qplot, ggplot

• lattice: xyplot, dotplot, parallelplot

• vcd: "Visualizing Categorical Data" mosaic, assoc

• rcmdr: "Rcommander" scatter3d

• rattle: Explore Tab.

latticist, ggobi

Page 39: Final pink panthers_03_30

Continuous vs. Discrete (Categorical) Variables

CORRELOGRAM: {base R} pairs()t <- as.data.frame(Survived,Pclass,Sex,

Age,Fare,Embarked,SibSp,Parch)

pairs(t, col=t$Pclass+2) # Shift base R color palette by 2# 1st class – green (1+2=3)# 2nd class – blue (2+2=4)# 3rd class – cyan (3+2=5)# base R Color Wheel is not very subtle!

• Correlogram is meant to show pair-wise relationships.

• Continuous variables appear as "clouds"

• Discrete variables appear as "bands"

Page 40: Final pink panthers_03_30

Continuous, Multivariate

Intensity Map{base R} heatmap()

• Useful for visualizing and

comparing data sets.

• Requires a data matrix.

• Values must be numeric (recode qualitative variables e.g.,

Pclass, Gender).

• Can use custom color palette

(e.g., RColorBrewer)

test does not have a

Survived attribute.

PassengerID 1:891 (train) 892:1309 (test)891 obs. 418 obs.

train is representative of test.

"Soup Analogy": values look like

they are randomly distributed and

"well-stirred" – no big chunks of

dark or light bands.

Models based on train can be used

to predict test fairly accurately.

Page 41: Final pink panthers_03_30

Continuous, Univariate

Histogram: {base R} hist()

Show range, density

and distribution of a

single, continuous

variable.

# Use 2X2 gridpar(mfrow=c(2,2))hist(test$Age)hist(test$Fare)hist(train$Age)hist(train$Fare)

"Small Multiples"

concept by Tukey:

Displaying multiple small

plots side-by-side is

effective for analysis.

test and train have

similar distributions for

continuous variables.

Page 42: Final pink panthers_03_30

"Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))

Categorical, Univariate

Bar Plots: {base R} barplot()

test and train have similar

distributions for

categorical variables.

Page 43: Final pink panthers_03_30

Continuous, Univariate

Dot Plot: {lattice} dotplot()

library(lattice)attach(train)# Each dot is# a passenger.# Survived==1 Red# Survived==0 Black

dotplot(Age,pch=1,col=Survived, main="train$Age")

dotplot(Fare,pch=1,col=Survived,main="train$Fare")

cluster of survivors

(young children)outliers

cluster of perished passengers

(who paid lowest fares).

Page 44: Final pink panthers_03_30

Continuous, Univariate

Box Plot: {Base R} boxplot()

Shows interquartile range (IQR),

Median, outliers.

# Plot Age grouped by Pclasspar(mfrow=c(1,2))Survivors <-train[train$Survived==1,]Perished <-train[train$Survived==0,]

boxplot(Age ~ Pclass, data = Survivors, col = "light blue", main="Survived", xlab="Passenger Class", ylab="Age")

boxplot(Age ~ Pclass, data = Perished, col = "gray", main="Perished", xlab="Passenger Class", ylab="Age")

Survivors had younger age

range compared to perished across

all three passenger classes.

Median

33.50 Median

28.00Median

27.00

Median

28.00

Median

30.00

Median

38.50

Page 45: Final pink panthers_03_30

Categorical, Multivariate

Spine Plot = 3 Bar Plots

35% 65% 68% 32% 15% 85%

314

577

233

109

81

468

32%68%FEMALES:

greater than expected

survival rate

85%MALES: greater than

expected mortality rate

15

%

Class: mutually exclusive, rectilinear partition. E.g., Female Survivors

Probability: frequency count/whole set. E.g, 233/891 = 68%

Spine Plot is a visualization of a

rules-based model; it exhaustively

describes the feature space = Titanic

Passengers (female vs male)

Page 46: Final pink panthers_03_30

Categorical, Multivariate Spine Plot: {base R} spineplot()

Indicates a higher

than expected survival rate.

Page 47: Final pink panthers_03_30

Visualization of a contingency table.

vcd = "Visualizing Categorical Data"Blue – High Frequency

Gray – Neutral

Red – Low Frequency ount.

Example:

3rd Class MaleSex==male & Pclass==3• High Frequency: Survived ==0• Low Frequency: Survived==1

# Mosaic Plot library(vcd)attach(train)t <-table(Sex,Survived,Child)mosaic(t, shade=TRUE,

main="train dataset")

Categorical, Multivariate

Mosaic Plot: {vcd} mosaic()

female adults

female children

male adults

male children

female

children

female

adults

male adults

male children

60%

Perished

40%

Survived

Page 48: Final pink panthers_03_30

females (survived)

36% of all passengers

77% of all survivors

male

adults

male children

female

children

female

adults

male adults (perished) 61% of all passengers

83% of all who perished

male children

Similar

Mosaic PlotDecision Tree

60% Perished

40% Survivedmale adults

(perished)male children

(survived)

females

(survived)

males

(perished)

Page 49: Final pink panthers_03_30

Continuous, Multivariate

Marginal Plots:

{rattle} latticist

• {rattle} is an R package

• latticist is an interactive GUI

for Data Visualization

Page 50: Final pink panthers_03_30

Which variables are correlated?(Models perform better when variables are independent!)

Correlation plots created using {rattle} R package

FamilySize

SibSp

Parch

Fare3

Fare

Age

Page 51: Final pink panthers_03_30

Rule-Based ModelsEveryone Survived vs. Everyone Perished

# Model: Everyone survivedtest$Survived <- 1submit <- data.frame(PassengerId = test$PassengerId, Survived =test$Survived)

write.csv(submit, file = "mdarling_model_0.csv", row.names = FALSE)

Result: 0.37321☹

# Model: Everyone perishedtest$Survived <- 0

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1.csv", row.names = FALSE)

Result: Your Best Entry: 0.62679 ☺

You improved on your best score by 0.25359.

You just moved up 12 positions on the leaderboard

Survival rate for test is similar to RMS Titanic

Page 52: Final pink panthers_03_30

Rule-Based ModelsRandom vs. Informed Guess

# Model: Random Guess

test$Survived <- sample(c(0,1), 418, replace = TRUE)

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1random.csv", row.names = FALSE)

Your submission scored 0.50718, ☹which is not an improvement of your best score.

Model: Informed Guess● Used problem domain info, data

visualizations and intuition to make an

“informed guess” about each passenger.

● Manually typed in 1,0 into test.csv file

with 418 rows…

Your Best Entry: 0.70335! ☺You improved on your best score by 0.07656!

Process is similar to

everyday human

decision-making

(no machine learning).

Score is much better

than random chance!

Page 53: Final pink panthers_03_30

Rule-Based Models"Females" / "Women or Children"

# Model: Females Survive

test$Survived <-0

test$Survived[test$Sex=='female']<-1

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_female.csv", row.names = FALSE)

Your Best Entry: 0.76555☺You improved on your best score by 0.06220.

# Model: Women OR Children Survivetest$Survived <-0

test$Survived[test$Sex=='female'] <-1test$Survived[test$Age<10] <-1# Tried different age cutoffs until score improved.

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_wc.csv", row.names = FALSE)

Your Best Entry: 0.77033☺You improved on your best score by 0.00478

Page 54: Final pink panthers_03_30

Rule-based model (70 rules)Sex : Child : Fare2: FamilySize

Principal Components Analysis

• Inspired by

PCA

• Performed

better than

naiveBayes,

qda, glm,

svm(radial,

sigmoid,

polynomial)!

aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)})

Page 55: Final pink panthers_03_30

Model Description Result

70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,

data=train, FUN=function(x) {sum(x)/length(x)})0.77512

Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033

Female [test$Sex =='female'] 0.76555

Informed Guess Data Visualization + Problem Domain info+ manual

typing 1,0 into .csv file.0.70335

Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718

Everyone Perished test$Survived <- 0 0.62679

Everyone Survived test$Survived <- 1 0.37321

Summary: kaggle.com results so far…

Page 56: Final pink panthers_03_30

START: Is training data available?

No UNSUPER-

VISED LEARNING

Yes -- train.csv SUPERVISED LEARNING

Continuous Target

REGRESSION

Categorical Target: Survived

CLASSIFICATION

Multivariate Classification

BINARY Classification == 1,0

SINGLE CLASSIFIERS

glm, knn, qda naiveBayes,

rpart, ctree, svm

ENSEMBLE METHODS

randomForest, cforest

Machine

Learning:

Titanic Dataset

Page 57: Final pink panthers_03_30

Overview of Machine Learning Algorithms

Page 58: Final pink panthers_03_30

QDA (0.75598) vs Logistic Regression (.76077)

• Linear model = straight line boundaries.

• Better fit for Titanic data set.

• Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model.

• Polynomial model = curved boundaries.

Page 59: Final pink panthers_03_30

Naïve Bayes (0.76555) vs. KNN (0.77990)

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="sknn")end <- (proc.time() - ptm)# 769.72 milliseconds – MORE TIME CONSUMING butMORE CUSTOMIZED BOUNDARIES –> greater accuracy.

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="naiveBayes")end <- (proc.time() - ptm)# 39.99 milliseconds – only 5% of the knn time.

Page 60: Final pink panthers_03_30

AdaBoost (0.77990 – same as KNN)

# rattle Model outputSummary of the Ada Boost model:Call:ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input,

crs$target)], control = rpart.control(maxdepth = 30, cp = 0.01, minsplit = 20, xval = 10), iter = 50)Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data:

Final PredictionTrue value 0 1

0 350 231 45 205

Train Error: 0.109 Out-Of-Bag Error: 0.136 iteration= 50 Additional Estimates of number of iterations:train.err1 train.kap1

50 50 Variables actually used in tree construction:[1] "Age" "FamilyID2" "Fare" "Sex" "Title" Frequency of variables actually used:FamilyID2 Fare Title Age Sex

49 49 48 46 8

Time taken: 3.42 secs

Only 50 trees compared

to 4000 trees in

cforest, hence lower

performance.

Page 61: Final pink panthers_03_30

linear, cost=1, 68% correct radial, cost=100, 73.4% correct

polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct

Support VectorMachines (2D)SVM Kernels

& Decision Boundary Shapes

• Linear Line

• Radial Circle

• Polynomial C Curve

• Sigmoid S Curve

"Goodness of Fit" – svm:

radial performed best with two

dimensions (.77033).

Page 62: Final pink panthers_03_30

Scatterplots for visualizing SVM 2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d

# Interactive 3D hyperplane with splinelibrary(Rcmdr); attach(train)scatter3d(Age,Survived,Fare)

# Point and Line ScatterPlotlibrary(ggplot2); attach(train)qplot(Age, Fare, data=train,

geom=c("point","line"),colour=Survived,main = "Titanic Passengers")

Page 63: Final pink panthers_03_30

SVM

using 11 inputs

Advantages of SVM:

• Minimal pre-processing needed.

• Tuning improves accuracy.

• Helps reveal best fit

(linear/poly/radial/sigmoid).

• Immune to "Curse of

Dimensionality".

• Instead of worsening, accuracy

improved when dimensions

increased from 2 to 11

attributes.

0.79904good, but still not

better than cforest

or randomForest

0.81818

Page 64: Final pink panthers_03_30

cforest (.81818) + Lifeboat Data Fusion = .83732

# Added 12 male survivors based on merged # lifeboat data from Encyclopedia Titanica.

ciforest2 <- read.csv("ciforest2.csv")testlb <- read.csv("test_lifeboats.csv")

ensembles <- merge(ciforest2, testlb, by.x="PassengerId", by.y="PassengerId")

ensembles$Survived[ensembles$Lifeboat==1] <-1table(ensembles$Survived)#0 1 #272 146

submit <- data.frame(PassengerId = ensembles$PassengerId, Survived = ensembles$Survived)write.csv(submit, file = "ensembles_5.csv", row.names = FALSE)

Page 65: Final pink panthers_03_30

"Ensemble of ensembles":randomForest + cForest + random tiebreaker

# Code for 95/05 tiebreaker (score 0.81818)

# Merge randomForest and cForest and average# the results. Reuse unanimous votes.ensembles <- merge(rforest, ciforest2, by.x="PassengerId", by.y="PassengerId")ensembles$Vote <-(as.numeric(ensembles$Survived.x)+ as.numeric(ensembles$Survived.y))/2ensembles$Survived[ensembles$Vote==1.0] <-1ensembles$Survived[ensembles$Vote==0.0] <-0

# Create vector of 418 random 0s and 1sset.seed(pi)probs<-c(.95,.05)ensembles$rvote <-sample(c(0,1), 418,replace = TRUE,prob=probs)

#For each tie, use a random voteensembles$Survived[ensembles$Vote==0.5] <-ensembles$rvote[ensembles$Vote==0.5]table(ensembles$Survived)

0 1 281 137

What if we combine results from randomForest and

cForest? Use random tiebreaker for non-unanimous votes.

Results: Combinations did not outperform individuals,

even when lifeboat data was added.

Page 66: Final pink panthers_03_30

Data mining using lifeboat info = competitive edge. 12

additional male survivors is highly significant because they

countered social norms and survived "against the odds".

Ensemble methods (randomForest, cforest) outperform

single classifiers. "Many models work better than one."

Embedded feature selection models (svm, ctree, rpart) outperform models that need "manual" feature

selection. Decision trees are great communication tools.

knn has same accuracy as glm and AdaBoost, but takes a lot

of processing time.

Simple rule-based models can outperform naiveBayes if

features chosen by Principal Components Analysis (PCA).

Social norms ("Women and Children First", "Male

survivors are cowards" ) greatly influenced survival.

Human decision-making outperforms random chance,

and can outperform machine learning (depending on the

human's expertise).

Math-based models like glm sensitive to feature selection.

"Goodness of fit" determines performance. Linear and

radial (glm, svm:linear/radial) outperformed others

(qda,svm:polynomial/sigmoid).

Machine Learning Summary