Poverty Prediction Shao Hung Lin Chendong Cai
Poverty PredictionShao Hung Lin
Chendong Cai
Background
Competition platform: DrivenData
Data source: World Bank
Purpose of the project: build a model to accurately predict the
poverty status using various survey data
Data Summary (household)
Household data: 8203 observations, 346 features (4 numerical features)
Data Summary (individual)
Individual data: 37560 observations, 44 features (1 numerical feature)
Performance metric: mean log loss
MeanLogLoss = -1
๐ ๐=1๐ [๐ฆ๐ log ๐ฆ๐ + (1 โ ๐ฆ๐) log(1 โ ๐ฆ๐)]
id Poor predicted probability
418 0.32
41249 0.28
16205 0.58
97051 0.36
67756 0.63
Workflow
Deal with
missing
value
On-hot
encoding on
categorical
features
Merge household data with individual
data
Build logistic regression with
gradient descent in Python
Build logistic regression in scikit-
learn with regularization and parameter tuning
Results
evaluation/
comparison
Data Preprocessing
Dealing with missing values
Data Preprocessing
One-hot encoding for categorical features
Data Preprocessing
Merge household data with individual data
Individual data
Household dataMerge
Logistic Regression
Sigmoid Function:
โข ๐ ๐ง =1
1+๐โ๐ง
Derivative of Sigmoid Function:
โข ๐โฒ ๐ง = ๐ ๐ง 1 โ ๐ ๐ง
Logistic Regression
In logistic regression, we define:
โ๐ ๐ฅ = ๐ ๐๐๐ฅ =
1
1 + ๐โ๐๐๐ฅ
P(y=1 | x ; ๐) = โ๐ ๐ฅ
P(y=0 | x ; ๐) = 1 โ โ๐ ๐ฅ
โน P(y | x ; ๐) = โ๐ ๐ฅ๐ฆ(1 โ โ๐ ๐ฅ )
1โ๐ฆ
Logistic Regression
The likelihood of the parameters is,
L(๐)=P( ๐ฆ |x ; ๐)= ๐=1๐ โ๐ ๐ฅ
๐ ๐ฆ(๐)(1 โ โ๐ ๐ฅ๐ )
1โ๐ฆ(๐)
Maximize the log likelihood,
โ ๐ = logL(๐)= ๐=1๐ ๐ฆ๐ log โ ๐ฅ๐ +(1 โ ๐ฆ๐) log(1 โ โ ๐ฅ๐ )
Use gradient ascent to maximize log-likelihood
Calculate the partial derivative:
๐โ(๐)
๐๐๐= (y
1
๐ ๐๐๐ฅ- (1-y)
1
1โ๐ ๐๐๐ฅ)๐
๐๐๐๐ ๐๐๐ฅ
= (y1
๐ ๐๐๐ฅ- (1-y)
1
1โ๐ ๐๐๐ฅ) ๐ ๐๐๐ฅ (1-๐ ๐๐๐ฅ )
๐
๐๐๐๐๐๐ฅ
= (y(1-๐ ๐๐๐ฅ ) โ (1-y) ๐ ๐๐๐ฅ )๐ฅ๐
= (y - โ๐ ๐ฅ ) ๐ฅ๐
โน ๐๐๐๐๐ก๐๐๐ ๐ ๐ข๐๐: ๐๐:= ๐๐ + ๐ผ
๐=1
๐
(๐ฆ๐ โ โ๐ ๐ฅ๐ )๐ฅ๐
๐
Realize Logistic Regression in Python
Vectorization
Feature scaling
Gradient Descent converges much faster with feature scaling than
without it.
contour of the cost function: โoval
shapedโcontour of the cost function: โcircle
shapedโ
Feature scaling
Before input the data into model, we need to standardize the data
first.
๐ง๐๐ =๐ฅ๐๐ โ ๐ฅ๐
๐ ๐
โข ๐ฅ๐๐ is ๐๐กโ data point in feature i
โข ๐ฅ๐ is the sample mean
โข ๐ ๐ is the standard deviation
Experiments: 0.0003 learning rate, 200 iterations
Experiments: 0.0001 learning rate, 200 iterations
Results and comparison
Our model LogisticRegression in scikit-
learn
Training Log Loss 0.1903 0.1901
Test Log Loss 0.3725 0.3687
Optimize the model by introducing Regularization
Cost Function with L1 Regularization
J(๐)=โ1
๐ ๐=1๐ ๐ฆ๐ log โ ๐ฅ๐ +(1 โ ๐ฆ๐) log(1 โ โ ๐ฅ๐ ) +
๐
๐ ๐=1๐ |๐๐|
Cost Function with L2 Regularization
J(๐)=โ1
๐ ๐=1๐ ๐ฆ๐ log โ ๐ฅ๐ +(1 โ ๐ฆ๐) log(1 โ โ ๐ฅ๐ ) +
๐
2๐ ๐=1๐ ๐๐
2
Negative log-likelihoodRegularization term
Minimize the cost function using Newtonโs method
๐ป๐๐ฝ =
1
๐ ๐=1๐ (โ๐ ๐ฅ
๐ โ ๐ฆ ๐ )๐ฅ0๐
1
๐ ๐=1๐ โ๐ ๐ฅ
๐ โ ๐ฆ ๐ ๐ฅ1๐ +
๐
๐๐1
1
๐ ๐=1๐ โ๐ ๐ฅ
๐ โ ๐ฆ ๐ ๐ฅ2๐ +
๐
๐๐2
โฎ1
๐ ๐=1๐ โ๐ ๐ฅ
๐ โ ๐ฆ ๐ ๐ฅ๐๐ +
๐
๐๐๐
๐ป =1
๐ ๐=1๐ โ๐ ๐ฅ
๐ (1 โ โ๐ ๐ฅ๐ )๐ฅ ๐ (๐ฅ ๐ )๐ +
๐
๐
0 0 โฆ 00 1 โฆ โฎโฎ โฎ โฑ โฎ
0 โฆ โฆ 1
Updating Rule: ๐(๐ก+1) = ๐๐ก โ๐ปโ1๐ป๐๐ฝ
Gradient
Hessian matrix
Results of model with regularization
Metrics (in average) Training Test
Log Loss 0.2294 0.2684
Accuracy 90.55% 86.38%
Comparison of regularized and non-regularized model
From the experiment we can see that the regularized model outperforms the non-regularized
logistic regression.