Introduction to Machine Learning - Fairness in Machine Learning · 2020-05-26 · Introduction IMain text - [1] ISolon Barocas, Moritz Hardt, Arvind Narayanan IOther recommended resources:

Introduction to Machine Learning

Fairness in Machine Learning

Varun ChandolaComputer Science & Engineering

State University of New York at BuffaloBuffalo, NY, USA

[email protected]

Chandola@UB CSE 474/574 1 / 51

Outline

Introduction to Fairness

Toy Example

Why fairness?Defining Fairness

Fairness in Classification Problems

Quantitative Metrics for FairnessIndependenceSeparationSufficiency

Case Study in Credit Scoring

References


Introduction

I Main text - https://fairmlbook.org [1]I Solon Barocas, Moritz Hardt, Arvind Narayanan

I Other recommended resources:I Fairness in machine learning (NeurIPS 2017)I 21 fairness definitions and their politics (FAT* 2018)I Machine Bias - COMPAS Study

I Must read - The Machine Learning Fairness Primer byDakota Handzlik

I Programming Assignment 3 and Gradiance Quiz #7

I Also see - The Mozilla Responsible Computer ScienceChallenge


https://fairmlbook.org

https://fairmlbook.org/tutorial1.html

https://fairmlbook.org/tutorial2.html

https://v2-www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

https://mlcourse-ub.readthedocs.io/en/latest/_downloads/1bca47fdf20c85bd281e40d193af1852/Machine_Learning_Fairness_Primer.pdf

https://mlcourse-ub.readthedocs.io/en/latest/_downloads/1bca47fdf20c85bd281e40d193af1852/Machine_Learning_Fairness_Primer.pdf

https://foundation.mozilla.org/en/blog/report-responsible-computer-science-challenge-person-kick-meeting/

https://foundation.mozilla.org/en/blog/report-responsible-computer-science-challenge-person-kick-meeting/

Toy Example

I Task: Learn a ML based jobhiring algorithm

I Inputs: GPA, Interview Score

I Target: Average performancereview

I Sensitive attribute: Binary(denoted by � and ∆),represents some demographicgroupI We note that GPA is

correlated with the sensitiveattribute

Process1. Regression model to predict target

2. Apply a threshold (denoted by green line) to select candidates


Toy Example

I ML models does not use sensitive attribute

I Does it mean it is fair?

I It depends on the definition of fairness

Fairness-as-blindness notionI Two individuals with similar features get similar treatment

I This model is fair


Toy Example







Toy Example







What about a different definition of fairness?

I Are candidates from the two groups equally likely to be hired?

I No - triangles are more likely to be hired than squares

I Why did the model become unfair because of this definition?I In the training data, average performance review is lower for squares

than triangles


What about a different definition of fairness?

I Are candidates from the two groups equally likely to be hired?

I No - triangles are more likely to be hired than squares

I Why did the model become unfair because of this definition?I In the training data, average performance review is lower for squares

than triangles


Why this disparity in the data?

I Many factors could have led to this:I Managers who score employee’s performance might have a biasI Workplace might be biased against one groupI Socio-economic background of one group might have resulted in

poor educational outcomesI Some intrinsic reasonI Combination of these factors

I Let us assume that this disparity that was learnt by the ML model isunjustified

I How do we get rid of this?


Making ML model bias-free

I Option 1: ignore GPA as a featureI Might result in poor accuracy of the model

I Option 2: pick different thresholds for each sub-groupI Model is no longer “blind”

I Option 3: add a diversity reward to the objective functionI Could still result in poor accuracy












Why fairness?

I We want/expect everything to be fair and bias-free

I Machine learning driven systems are everywhere

I Obviously we want them to be fair as wellI Closely related are issues of ethics, trust, and accountability


What does fairness mean?

I Consequential decision making: ML system makes a decision thatimpacts individualsI admissions, job offers, bail granting, loan approvals

I Should use factors that are relevant to the outcome of interest


Amazon same-day delivery

I A data-driven system todetermine neighborhoods tooffer same-day delivery service

I In many U.S. cities, white residents were more than twice as likely asblack residents to live in one of the qualifying neighborhoods.

I Src: - https://www.bloomberg.com/graphics/2016-amazon-same-day/


https://www.bloomberg.com/graphics/2016-amazon-same-day/

https://www.bloomberg.com/graphics/2016-amazon-same-day/

ML - Antithesis to fairness

I Machine learning algorithms are based on generalization

I Trained on historical data which can be unfairI Our society has always been unfair

I Can perpetuate historical prejudices


Continuing with the Amazon example

I Amazon claims that race was not a factor in their model (not afeature)

I Was designed based on efficiency and cost considerations

I Race was implicitly coded


When is there a fairness issue?

I What if the Amazon system was such that zip codes ending in anodd digit are selected for same-day delivery?

I It is biased and maybe unfair to individuals living in the evennumbered zipcodes

I But will that trigger a similar reaction?

I Is the system unfair?


What do we want to do?

I Make machine learning algorithms fair

I Need a quantifiable fairness metricI Similar to other performance metrics such as precision, recall,

accuracy, etc.

I Incorporate the fairness metric in the learning process

I Often leads to a tension with other metrics


How does an ML algorithm becomes unfair?

I The “ML for People” Pipeline

State of the world

Data Model

Individuals

Measurement

Learning

Action Feedback


How does an ML algorithm becomes unfair?

I The “ML for People” Pipeline

State of the world

Data Model

Individuals

Measurement

Learning

Action Feedback


Issues with the state of the society

I Most ML applications are aboutpeopleI Even a pothole identification

algorithm

I Demographic disparities exist insociety

I These get embedded into thetraining data

I As ML practitioners we are notfocused on removing thesedisparities

I We do not want ML toreinforce these disparities

I The dreaded feedbackloops [3]


Measurement Issues

I Measurement of data is fraught with subjectivity and technical issues

I Measuring race, or any categorical variable, depends on how thecategories are defined

I Most critical - defining the target variableI Often this is “made up” rather than measured objectivelyI credit-worthiness of a loan applicantI attractiveness of a face (beauty.ai, FaceApp)

Criminal Risk Assessment1. Target variable - bail or not?

2. Target variable - will commit a crime later or not (recidivism)?


Measurement Issues

I Technical issues can often lead to biasI Default settings of cameras are

usually optimized for lighter skintones [5]

I Most images data sets used to train object recognition systems arebiased relative to each otherI http://people.csail.mit.edu/torralba/research/bias/


How to fix the measurement bias?

I Understand the provenance of the dataI Even though you (ML practitioner) are working with data “given” to

you

I “Clean” the data


Issues with models

I We know the training data can have biases

I Will the ML model preserve, mitigate or exacerbate these biases?

I ML model will learn a pattern in the data that assists in optimizingthe objective function

I Some patterns are useful - smoking is associated with cancer, someare not - girls like pink and boys like blue

I But ML algorithm has not way of distinguishing between these twotypes of patternsI established by social norms and moral judgements

I Without a specific intervention, the ML algorithm will extractstereotypes


An Example

I Machine translation


How to make the ML model more fair

I Model reflects biases in the data

I Withold sensitive attributes (gender, race, . . . )

I Is that enough?

Unfortunately notI There could be proxies or redundant encodings

I Example - Using “programming experience in years” might indirectlyencode gender biasI Age at which someone starts programming is well-known to be

correlated with gender



I Model reflects biases in the data

I Withold sensitive attributes (gender, race, . . . )

I Is that enough?

Unfortunately notI There could be proxies or redundant encodings

I Example - Using “programming experience in years” might indirectlyencode gender biasI Age at which someone starts programming is well-known to be

correlated with gender



I Better objective functions that are fair to all sub-groupsI More about this in next lecture

I Ensure equal error rate for all sub-groups

The Nymwars Controversy

I Google, Facebook and other companies blocking users with uncommonnames (presumably fake)

I Higher error rate for cultures with a diverse set of names


The pitfalls of action

I While as ML practitioners our world ends after we have trained agood model

I But this model will impact people

I Need to understand that impact in the larger socio-technical systemI Are there disparities in the error across different sub-groups?I How do these disparities change over time (drift)?I What is the perception of society about the model?

I Ethics, trustworthiness, accountabilityI Explainability and interpretabilityI Correlation is not causation


The perils of feedback loops

I The “actions” made by individuals based on the predictions of theML model could be fed back into the system, either explicitly orimplicitlyI Self-fulfilling predictionsI Predictions impacting the training dataI Predictions impacting the society


Problem Setup

NotationI Predict Y given X

I Y is our target class Y ∈ {0, 1}I X represents the input feature vector

ExampleI Y - Will an applicant pay the loan back?

I X - Applicant characteristics - credit history, income, etc.


Supervised Learning

I Given training data: (x1, y1), . . . , (xN , yN)

I Either learn a function f , such that:

y∗ = f (x∗)

I Or, assume that the data was drawn from a probability distribution

I In either case, we can consider the classification output as a randomvariable Y

I Now we have three random variables:

X,Y , Y

I We are going to ignore how we get Y from X for these discussions


How do we measure the quality of a classifier?

I So far we have been looking at accuracy

A different way to look at accuracy

Accuracy ≡ P(Y = Y )

I Probability of the predicted label to be equal to the true label

I How do we calculate this?


Accuracy is not everyting!

I Consider a test data set with 90 examples with true class 1 and 10examples with true class 0

I A degenerate classifier that classifies everything as label 1, wouldstill have a 90% accuracy on this data set

Other evaluation criteria

Event Condition Metric

Y = 1 Y = 1 True positive rate (recall on positive class)

Y = 0 Y = 1 False negative rate

Y = 1 Y = 0 False positive rate

Y = 0 Y = 0 True negative rate (recall on negative class)

I Here we are treating class label 1 as the positive class and class label0 as the negative class.


We can swap the condition and the event

Event Condition Metric

Y = 1 Y = 1 precision (on positive class)

Y = 0 Y = 0 precision (on negative class)


Score Functions

I Often classification involves computing a score and then applying athreshold

I E.g., Logistic regression: first calculate P(Y = 1|X = x), then applya threshold of 0.5

I Or, Support Vector Machine: first calculate w>x and then apply athreshold of 0

Conditional Expectation

r(x) = E[Y |X = x]

I We can treat it as a random variable too R = E[Y |X]

I This is what logistic regression uses.


From scores to classification

I Use a threshold t

y =

{1 if r(x) ≥ t,0 otherwise

I What threshold to choose?I If t is high, only few examples with very high score will be classified

as 1 (accepted)I If t is low, only few examples with very low score will be classified as

0 (rejected)


The Reciever Operating Characteristic (ROC) Curve

I Exploring the entire range of t

I Each point on the plot is theFPR and TPR for a given valueof t

I Area under the ROC curve orAUC is a quantitative metricderived from ROC curve


Sensitive Attributes

I Let A denote the attribute representing the sensitive characteristic ofan individual

I There could be more than one sensitive attributes


Things to remember

I It is not always easy to identify A and differentiate it from X

I Removing the sensitive attribute from X does not guarantee fairness

I Removing the sensitive attribute could make the classifier lessaccurate

I Not always a good idea to remove the impact of sensitive attributes


Quantifying Fairness

I Let us define some reasonable ways of measuring fairnessI There are several ways to do thisI All are debatable

I Three different categories

Independence Separation Sufficiency

Y ⊥⊥ A Y ⊥⊥ A|Y Y ⊥⊥ A|Y

I Y - True label; Y - Predicted label; A - Sensitive attribute;

Conditional Independence

A ⊥⊥ B|C ⇐ P(A,B|C ) = P(A|C )P(B|C )

I Amount of Speeding fine ⊥⊥ Type of Car | Speed


Independence

P(Y = 1|A = a) = P(Y = 1|A = b),∀a, b ∈ A

I Referred to as demographic parity, statistical parity, group fairness,disparate impact, etc.

I Probability of an individual to be assigned a class is equal for eachgroup

Disparate Impact Law

P(Y = 1|A = a)

P(Y = 1|A = b)≥ 1− ε

For ε = 0.2 - 80 percent rule


Issues with independence measures

I The self fulfilling prophecy [2]

I Consider the hiring scenario where the model picks p excellentcandidates from group a and p poor quality candidates from group bI Meets the independence criteriaI However, it is still unfair


How to satisfy fairness criteria?

1. Pre-processing phase: Adjust the feature space to be uncorrelatedwith the sensitive attribute.

2. Training phase: Build the constraint into the optimization processfor the classifier.

3. Post-processing phase: Adjust a learned classifier so that it isuncorrelated to the sensitive attribute


Separation

Y ⊥⊥ A|Y

I Alternatively, the true positive rate and the false positive rate isequal for any pair of groups:

P(Y = 1|Y = 1,A = a) = P(Y = 1|Y = 1,A = b)

P(Y = 1|Y = 0,A = a) = P(Y = 1|Y = 0,A = b)

∀a, b ∈ A

I Can handle the discrepancy with the independence metric mentionedearlier


How to achieve separation

I Apply post-processing step using the ROC CurveI Plot ROC curve for each groupI Within the constraint region (overlap), pick a classifier that

minimizes the given cost


How to achieve separation

I Apply post-processing step using the ROC CurveI Plot ROC curve for each groupI Within the constraint region (overlap), pick a classifier that

minimizes the given cost


Sufficiency

Y ⊥⊥ A|R

I Alternatively, the true positive rate and the false positive rate isequal for any pair of groups:

P(Y = 1|R = r ,A = a) = P(Y = 1|R = r ,A = b)

∀r ∈ dom(R) and a, b ∈ A


Achieving sufficieny by calibration

What is calibration?I Let us revert back to the score R

I Recall that Y was obtained by applying a threshold on R

I R is calibrated, if for all r in the domain of R:

P(Y = 1|R = r) = r

I Of course, this means that R should be between 0 and 1

I Platt Scaling: Converts an uncalibrated score to a calibratedscore [4]

I Calibration by group implies sufficiencyI Apply Platt scaling to each group defined by the sensitive attribute


Case Study: Credit Scoring

I Extend loan or not - based on the risk that a loan applicant willdefault on a loan

I Data from the Federal ReserveI A - Demographic information (race)I R - Credit scoreI Y - Default or not (defined by credit bureau)

Table: Credit score distribution by race

Race or ethnicity Samples with both score and outcomeWhite 133,165Black 18,274

Hispanic 14,702Asian 7,906Total 174,047


Group-wise distribution of credit score

I Strongly depends on the group


Using credit score for classification

I How make the classifier fair?


Four Strategies

1. Maximum profit: Pick group-dependent score thresholds in a waythat maximizes profit

2. Single threshold: Pick a single uniform score threshold for all groupsin a way that maximizes profit

3. Separation: Achieve an equal true/false positive rate in all groups.Subject to this constraint, maximize profit.

4. Independence: Achieve an equal acceptance rate in all groups.Subject to this constraint, maximize profit.

What is the profit?I Need to assume a reward for a true positive classification and a

cost/penalty for a false positive classification

I We will assume that cost of a false positive is 6 times greater thanthe reward for a true positive.


Comparing different criteria


References I

S. Barocas, M. Hardt, and A. Narayanan.Fairness and Machine Learning.fairmlbook.org, 2019.http://www.fairmlbook.org.

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel.Fairness through awareness.In Proceedings of the 3rd Innovations in Theoretical ComputerScience Conference, ITCS ’12, page 214–226, New York, NY, USA,2012. Association for Computing Machinery.

D. Ensign, S. A. Friedler, S. Neville, C. Scheidegger, andS. Venkatasubramanian.Runaway feedback loops in predictive policing.In Conference on Fairness, Accountability and Transparency, FAT2018, 23-24 February 2018, New York, NY, USA, volume 81 ofProceedings of Machine Learning Research, pages 160–171. PMLR,2018.


http://www.fairmlbook.org

References II

J. Platt.Probabilistic outputs for support vector machines and comparisonsto regularized likelihood methods.Adv. Large Margin Classif., 10, 06 2000.

L. Roth.Looking at shirley, the ultimate norm: Colour balance, imagetechnologies, and cognitive equity.Canadian Journal of Communication, 34:111–136, 2009.


Introduction to Machine Learning - Fairness in Machine Learning · 2020-05-26 · Introduction IMain text - [1] ISolon Barocas, Moritz Hardt, Arvind Narayanan IOther recommended resources:

Documents