Top Banner
Poverty Prediction Shao Hung Lin Chendong Cai
23

Poverty PredictionPerformance metric: mean log loss MeanLogLoss = -1 ๐‘ รก=1 ๐‘[ รกlog รก+(1โˆ’ รก)log(1โˆ’ รก)] id Poor predicted probability 418 0.32 41249 0.28 16205 0.58 Workflow

Feb 03, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Poverty PredictionShao Hung Lin

    Chendong Cai

  • Background

    Competition platform: DrivenData

    Data source: World Bank

    Purpose of the project: build a model to accurately predict the

    poverty status using various survey data

  • Data Summary (household)

    Household data: 8203 observations, 346 features (4 numerical features)

  • Data Summary (individual)

    Individual data: 37560 observations, 44 features (1 numerical feature)

  • Performance metric: mean log loss

    MeanLogLoss = -1

    ๐‘ ๐‘›=1๐‘ [๐‘ฆ๐‘› log ๐‘ฆ๐‘› + (1 โˆ’ ๐‘ฆ๐‘›) log(1 โˆ’ ๐‘ฆ๐‘›)]

    id Poor predicted probability

    418 0.32

    41249 0.28

    16205 0.58

    97051 0.36

    67756 0.63

  • Workflow

    Deal with

    missing

    value

    On-hot

    encoding on

    categorical

    features

    Merge household data with individual

    data

    Build logistic regression with

    gradient descent in Python

    Build logistic regression in scikit-

    learn with regularization and parameter tuning

    Results

    evaluation/

    comparison

  • Data Preprocessing

    Dealing with missing values

  • Data Preprocessing

    One-hot encoding for categorical features

  • Data Preprocessing

    Merge household data with individual data

    Individual data

    Household dataMerge

  • Logistic Regression

    Sigmoid Function:

    โ€ข ๐‘” ๐‘ง =1

    1+๐‘’โˆ’๐‘ง

    Derivative of Sigmoid Function:

    โ€ข ๐‘”โ€ฒ ๐‘ง = ๐‘” ๐‘ง 1 โˆ’ ๐‘” ๐‘ง

  • Logistic Regression

    In logistic regression, we define:

    โ„Ž๐œƒ ๐‘ฅ = ๐‘” ๐œƒ๐‘‡๐‘ฅ =

    1

    1 + ๐‘’โˆ’๐œƒ๐‘‡๐‘ฅ

    P(y=1 | x ; ๐œƒ) = โ„Ž๐œƒ ๐‘ฅ

    P(y=0 | x ; ๐œƒ) = 1 โˆ’ โ„Ž๐œƒ ๐‘ฅ

    โŸน P(y | x ; ๐œƒ) = โ„Ž๐œƒ ๐‘ฅ๐‘ฆ(1 โˆ’ โ„Ž๐œƒ ๐‘ฅ )

    1โˆ’๐‘ฆ

  • Logistic Regression

    The likelihood of the parameters is,

    L(๐œƒ)=P( ๐‘ฆ |x ; ๐œƒ)= ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ

    ๐‘– ๐‘ฆ(๐‘–)(1 โˆ’ โ„Ž๐œƒ ๐‘ฅ๐‘– )

    1โˆ’๐‘ฆ(๐‘–)

    Maximize the log likelihood,

    โ„“ ๐œƒ = logL(๐œƒ)= ๐‘–=1๐‘š ๐‘ฆ๐‘– log โ„Ž ๐‘ฅ๐‘– +(1 โˆ’ ๐‘ฆ๐‘–) log(1 โˆ’ โ„Ž ๐‘ฅ๐‘– )

  • Use gradient ascent to maximize log-likelihood

    Calculate the partial derivative:

    ๐œ•โ„“(๐œƒ)

    ๐œ•๐œƒ๐‘—= (y

    1

    ๐‘” ๐œƒ๐‘‡๐‘ฅ- (1-y)

    1

    1โˆ’๐‘” ๐œƒ๐‘‡๐‘ฅ)๐œ•

    ๐œ•๐œƒ๐‘—๐‘” ๐œƒ๐‘‡๐‘ฅ

    = (y1

    ๐‘” ๐œƒ๐‘‡๐‘ฅ- (1-y)

    1

    1โˆ’๐‘” ๐œƒ๐‘‡๐‘ฅ) ๐‘” ๐œƒ๐‘‡๐‘ฅ (1-๐‘” ๐œƒ๐‘‡๐‘ฅ )

    ๐œ•

    ๐œ•๐œƒ๐‘—๐œƒ๐‘‡๐‘ฅ

    = (y(1-๐‘” ๐œƒ๐‘‡๐‘ฅ ) โ€“ (1-y) ๐‘” ๐œƒ๐‘‡๐‘ฅ )๐‘ฅ๐‘—

    = (y - โ„Ž๐œƒ ๐‘ฅ ) ๐‘ฅ๐‘—

    โŸน ๐‘ˆ๐‘๐‘‘๐‘Ž๐‘ก๐‘–๐‘›๐‘” ๐‘…๐‘ข๐‘™๐‘’: ๐œƒ๐‘—:= ๐œƒ๐‘— + ๐›ผ

    ๐‘–=1

    ๐‘š

    (๐‘ฆ๐‘– โˆ’ โ„Ž๐œƒ ๐‘ฅ๐‘– )๐‘ฅ๐‘—

    ๐‘–

  • Realize Logistic Regression in Python

    Vectorization

  • Feature scaling

    Gradient Descent converges much faster with feature scaling than

    without it.

    contour of the cost function: โ€˜oval

    shapedโ€™contour of the cost function: โ€˜circle

    shapedโ€™

  • Feature scaling

    Before input the data into model, we need to standardize the data

    first.

    ๐‘ง๐‘–๐‘— =๐‘ฅ๐‘–๐‘— โˆ’ ๐‘ฅ๐‘—

    ๐‘ ๐‘—

    โ€ข ๐‘ฅ๐‘–๐‘— is ๐‘—๐‘กโ„Ž data point in feature i

    โ€ข ๐‘ฅ๐‘— is the sample mean

    โ€ข ๐‘ ๐‘— is the standard deviation

  • Experiments: 0.0003 learning rate, 200 iterations

  • Experiments: 0.0001 learning rate, 200 iterations

  • Results and comparison

    Our model LogisticRegression in scikit-

    learn

    Training Log Loss 0.1903 0.1901

    Test Log Loss 0.3725 0.3687

  • Optimize the model by introducing Regularization

    Cost Function with L1 Regularization

    J(๐œƒ)=โˆ’1

    ๐‘š ๐‘–=1๐‘š ๐‘ฆ๐‘– log โ„Ž ๐‘ฅ๐‘– +(1 โˆ’ ๐‘ฆ๐‘–) log(1 โˆ’ โ„Ž ๐‘ฅ๐‘– ) +

    ๐œ†

    ๐‘š ๐‘—=1๐‘› |๐œƒ๐‘—|

    Cost Function with L2 Regularization

    J(๐œƒ)=โˆ’1

    ๐‘š ๐‘–=1๐‘š ๐‘ฆ๐‘– log โ„Ž ๐‘ฅ๐‘– +(1 โˆ’ ๐‘ฆ๐‘–) log(1 โˆ’ โ„Ž ๐‘ฅ๐‘– ) +

    ๐œ†

    2๐‘š ๐‘—=1๐‘› ๐œƒ๐‘—

    2

    Negative log-likelihoodRegularization term

  • Minimize the cost function using Newtonโ€™s method

    ๐›ป๐œƒ๐ฝ =

    1

    ๐‘š ๐‘–=1๐‘š (โ„Ž๐œƒ ๐‘ฅ

    ๐‘– โˆ’ ๐‘ฆ ๐‘– )๐‘ฅ0๐‘–

    1

    ๐‘š ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ

    ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ1๐‘– +

    ๐œ†

    ๐‘š๐œƒ1

    1

    ๐‘š ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ

    ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ2๐‘– +

    ๐œ†

    ๐‘š๐œƒ2

    โ‹ฎ1

    ๐‘š ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ

    ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘›๐‘– +

    ๐œ†

    ๐‘š๐œƒ๐‘›

    ๐ป =1

    ๐‘š ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ

    ๐‘– (1 โˆ’ โ„Ž๐œƒ ๐‘ฅ๐‘– )๐‘ฅ ๐‘– (๐‘ฅ ๐‘– )๐‘‡ +

    ๐œ†

    ๐‘š

    0 0 โ€ฆ 00 1 โ€ฆ โ‹ฎโ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ

    0 โ€ฆ โ€ฆ 1

    Updating Rule: ๐œƒ(๐‘ก+1) = ๐œƒ๐‘ก โˆ’๐ปโˆ’1๐›ป๐œƒ๐ฝ

    Gradient

    Hessian matrix

  • Results of model with regularization

    Metrics (in average) Training Test

    Log Loss 0.2294 0.2684

    Accuracy 90.55% 86.38%

  • Comparison of regularized and non-regularized model

    From the experiment we can see that the regularized model outperforms the non-regularized

    logistic regression.