Robust regression - Robust estimation of regression ...artax.karlin.mff.cuni.cz/~adaml5am/Seminar/0910z/Franc-prez.pdf · Introduction Robust regression Examples Conclusion Robust

Introduction Robust regression Examples Conclusion

Robust regressionRobust estimation of regression coefficients in linear regression

model

Jiří Franc

Czech Technical UniversityFaculty of Nuclear Sciences and Physical Engineering

Department of Mathematics

Jiří Franc Robust regression 16. 11. 2009


Outline

1 IntroductionRegression analysisClassical regression estimators

2 Robust regressionL-estimatorsM-estimatorsR-estimatorsS-estimatorsLeast median of squares (LMS)Least trimmed squares (LTS)

3 Examples

4 Conclusion



Linear regression model

Definition

The multiple linear regression model is the model

Yi = Xi,1β01 + Xi,2β

02 + · · ·+ Xi,pβ

0p + ei = XT

i β0 + ei i = 1 . . . n.

Where

n is the sample size,

Xi=(Xi,1,Xi,2, . . . ,Xi,p)T is called explanatory variables or regressors and

it is a sequence of deterministic p-dimensional vectors or a sequence ofrandom variables,

Yi is called response variable (dependent variable) and it is i-th elementof the random sequence of observations,

β0=(β01 , β

02 , . . . , β

0p)T is a p-dimensional vector of true regression

coefficients,

ei is called the sequence of disturbances (error terms), it representsunexplained variation in the dependent variable and it is a sequence ofrandom variables.



Linear regression model

Let rewrite the previous definition with n equations in matrix notation.

DefinitionY1Y2...

Yn

=

X1,1 X1,2 . . . X1,pX2,1 X2,2 . . . X2,p

......

. . ....

Xn,1 Xn,2 · · · Xn,n

β01β0

2...β0

p

+

e1e2...en

Equivalently,

Y = Xβ0 + e



Classical regression estimators

Ordinary least squares method (OLS)

βOLS = arg minβ∈Rp

n∑i=1

(Yi − XTi β)2 = arg min

β∈Rp(Y − Xβ)T (Y − Xβ)

βOLS = (XTX)−1XTY

In the certain conditions OLS is the best linear unbiased estimatorof β0.

In the certain conditions OLS is the best estimator among allunbiased estimators (ordinary least squares method is best formultiple regression when the iid errors are normal distributed.)

OLS is not robust and consequently often gives false result for realdata (even a single regression outlier can totally offset the OLSestimator).



Motivation

An example of outliers in y-direction and OLS estimate.

● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●

●

●

●

●

●● ●

50 55 60 65 70

05

1015

20

Number of International Calls (in tens of millions) from Belgium dependence on Year

Year

NoC

alls

●

●

LegendDatasetOLS−estimate



Motivation

An example of outliers in y-direction and OLS estimate.

0 1 2 3 4 5

12

34

56

X

Y




Motivation

An example of outliers in x-direction and OLS estimate.

0 2 4 6 8 10

05

1015

X

Y




Robust regression

The main aims of robust statistics:

description of the structure best fitting the bulk of the data.

identification of deviating data points (outliers) or deviatingsubstructures for further treatment.

identification of highly influential data points (leverage points) or atleast warning about them.

deal with unsuspected serial correlations.

Two ways how to deal with regression outliers:

Regression diagnostics: where certain quantities are computedfrom the data with the purpose of pinpointing influential points,after which these outliers can be removed or corrected.

Robust regression: which tries to devise estimators that are not sostrongly affected by outliers.



Robust regression

The main aims of robust statistics:

description of the structure best fitting the bulk of the data.

identification of deviating data points (outliers) or deviatingsubstructures for further treatment.

identification of highly influential data points (leverage points) or atleast warning about them.

deal with unsuspected serial correlations.

Two ways how to deal with regression outliers:

Regression diagnostics: where certain quantities are computedfrom the data with the purpose of pinpointing influential points,after which these outliers can be removed or corrected.

Robust regression: which tries to devise estimators that are not sostrongly affected by outliers.



L-estimators

α− trimmed estimators

β(Lte ,α) =1

n − 2 bnαc

n−bnαc∑i=bnαc+1

Y(i),

where (Y(1), . . . ,Y(n)) is the order statistic and α is definite coefficient:α ∈ 〈0, 1〉.

α− regression quantile (Koenker, Bassett (1978))

β(Lrq ,α) = arg minβ∈Rp

n∑i=1

ρα(ri ),

where ri (β) = Yi − XTi β i = 1 . . . n and ρα(ri ) =

{α |ri | if ri ≥ 0(α− 1) |ri | if ri < 0

.

L-estimators are scale equivariant and regression equivariant.However the breakdown point is still 0% and for α = 0.5 holds, 0.5-regressionquantile estimator is least absolute values estimator.



L-estimators

Least absolute values method

βL1 = arg minβ∈Rp

n∑i=1

∣∣Yi − XTi β∣∣ = arg min

β∈Rp

n∑i=1

|ri (β)|

Least maximum deviation method

βL∞ = arg minβ∈Rp

(maxi∈n

|ri (β)|)



M-estimators

M-estimators are based on the idea of replacing the squared residuals used inOLS estimation by another function of the residuals.

M-estimators

β(M) = arg minβ∈Rp

n∑i=1

ρ(ri (β)),

where ρ is a symmetric function with a unique minimum at zero.

Differentiating this expression with respect to the regression coefficients yields:

M-estimatorsn∑

i=1

ψ(ri )Xi = 0,

where ψ is the derivative of ρ. The M-estimate is obtained by solving thissystem of p equations.



M-estimators

OLS , L1 are also M-estimators with ψ(t) = t for OLS and ψ(t) = sgn(t)for L1 estimate.

M-estimators are unfortunately not scale equivariant even if they areregression equivariant. Hence one has to studentizate the M-estimators byan estimate of scale of disturbances σ necessarily.

β(M) = arg minβ∈Rp

n∑i=1

ρ

(ri (β)

σ

),

One possibility is to use the median absolute deviation (MAD):

σ = C mediani

(∣∣∣∣ri −medianj

(rj)∣∣∣∣) ,

where C is a constant (correction factor), which depends on the distribution.For normally distributed data C = 1.4826.



M-estimators

Let vectors (Yi ,XTi )T are iid with distribution function F (Y ,X). If the function

ρ has an absolutely continuously derivation ψ, σ = 1 for simplicity and thefunctional T (F ) corresponding to the M-estimator then the functional T (F ) isthe solution of ∫

ψ(Y − XTT (F ))XdF (X,Y ) = 0.

DefineM(ψ,F ) :=

∫ψ′(Y − XTT (F ))XXTdF (X,Y ).

Then the influence function of T at a distribution F (on Rp × R) is given by

IF (X0,Y0,T ,F ) = M−1(ψ,F )X0ψ(Y0 − X0TT (F )).

The influence function with respect of Y0 can by bounded by choice of ψ , butthe influence function of M-estimators is unbounded in respect of X0.

The breakdown point of M-estimators is 0% due to the vulnerability to leveragepoints.



M-estimators

Maronna and Yoahai (1981) showed, under certain conditions, thatM-estimators are consistent and asymptoticly normal with asymptoticcovariance matrix

V (T ,F ) = E[IF (X,Y ,T ,F )IFT (X,Y ,T ,F )

]=

=∫

IF (X,Y ,T ,F )IFT (X,Y ,T ,F )dF (X,Y ) == M(ψ,F )−1Q(ψ,F )M(ψ,F )−1

whereM(ψ,F ) =

∫ψ′(Y − XTT (F ))XXTdF (X,Y )

Q(ψ,F ) =∫ψ2(Y − XTT (F ))XXTdF (X,Y )

If disturbances are normal distributed (ei ∼ N(0, σ0)) and iid , the multivariateM-estimator has the asymptotic covariance matrix

V (ψ,Φ) =E [(XXT )−1]σ2

e,

where e is the asymptotic efficiency defined by

e =

(∫ψ′dΦ

)2∫ψ2dΦ

.



M-estimators

Huber minimax M-estimator

ψ(t) =

{t if t < bb sgn(t) if t ≥ b

where b is a constant.

Andrew M-estimator

ψ(t) =

{sin(t) if − π ≤ |t| < π

0 otherwise

Tukey M-estimator

ψ(t) =

t(1−

( tc

)2)2

if |t| < c

0 otherwise

where c is a constant.

Descending minimaxM-estimator

ψ(t) =

t if |t| < ab sgn(t) tanh

[ 12b(c − |t|)

]if a ≤ |t| < c

0 otherwise

where a, b and c are constants.

Hampel M-estimator

ψ(t) =

t if |t| < aa sgn(t) if a ≤ |t| < bc−|t|c−b sgn(t) if b ≤ |t| < c0 otherwise

where a, b and c are constants.



M-estimators

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

Huber minimax (b=1)

t

psi(t

)



M-estimators

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

Hampel (a=1,b=2,c=3)

t

psi(t

)



M-estimators

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

Descending minimax (a=1,b=1.2,c=3)

t

psi(t

)



M-estimators

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

Andrew

t

psi(t

)



M-estimators

−4 −2 0 2 4

−0.

50.

00.

5

Tukey (c=3)

t

psi(t

)



GM-estimators

Generalized M-estimators are introduced in order to bounding theinfluence function of outlying Xi ’s by means of some weight function w .

GM-estimators

β(GM) = arg minβ∈Rp

n∑i=1

w(Xi )ρ(ri (β))

σ

The definition can be rewrite ton∑

i=1

w(Xi )ψ( riσ

)Xi = 0.

Unfortunately Maronna, Buston and Yohai (1979) showed that thebreakdown point of GM-estimators can be no better than a certain valuethat decrease as a function of p−1, where p is the number of regressioncoefficients.



GM-estimators

The algorithm of Iteratively reweighted least squares with GM-estimatesbased on some ψ function is following.

1 The first elementary estimate β(OLS) of β0.

2 Count the residuals ri (β) = Yi − Yi = Yi − XTi β i = 1 . . . n.

3 Count the estimate σ of σ.

(e.g. MAD: σ = 1.4826 mediani

(∣∣∣∣ri −medianj

(rj)∣∣∣∣) )

4 Count the weights wi .

(e.g. Andrew’s ψ function: wi =ψ( ri

σ )riσ

)

5 Update the estimate β by performing a weighted least squares withthe weights wiCalculate β(WLS) = (XTWX)−1XTWY

6 Go back to item 2 and iterate until convergence



R-estimators

R-estimation is procedure based on the ranks of the residuals. The idea ofusing rank statistics has been extended to the domain of multipleregression by Jurečková(1971) and Jaeckel(1972).

Let Ri (β) be the rank of Yi − XiTβ, further let an(i), i = 1, . . . , n be a

nondecreasing scores function, satisfying

n∑i=1

an(i) = 0.

R-estimator

β(R,ai ) = arg minβ∈Rp

n∑i=1

an(Ri (β))(Yi − XTi β) = arg min

β∈Rp

n∑i=1

an(Ri )ri .



R-estimators

Some possibilities for the scores function an(i) are:

Wilcoxon scores

an(i) = i − n + 12

Median scores

an(i) = sgn(

i − n + 12

)Bounded normal scores

an(i) = min(

c,max{

Φ−1(

in + 1

),−c

}),

where c is a parameter .

Van der Waerden scores

an(i) = Φ−1(

in + 1

)



R-estimators

The most widely used R-estimator is the estimator with Wilcoxonscores function, because the resulting R-estimator have a relativelysimple structure.

R-estimators are automatically scale equivariant.

The breakdown point of R-estimators is still 0% due to thevulnerability to leverage points.

In the case of regression with intercept, the constant term isestimated separately.

In order to solve the R-estimates, one has to find the minimum ofthe objective function by varying the unknown parameters.



S-estimators

S-estimators were introduced by Rousseeuw and Yohai (1984), they arederived from a scale statistic in an implicit way and they are regression,scale and affine equivariant.

S-estimators are defined by minimization of the dispersion of the residualsby:

S-estimator

β(S,c,K) = arg minβ∈Rp

s(r1(β), . . . , rn(β)) = arg minβ∈Rp

s(β),

where s(β) is estimator of scale defined as the solution of equation

1n

n∑i=1

ρ

(ri (β)

s(β)

)= K .



S-estimators

The function ρ must satisfy the following conditions:

C1 ρ is symmetric and continuously differentiable functionand ρ(0) = 0.

C2 There exists parameter c > 0 such that ρ is strictlyincreasing on 〈0, c〉 and constant on 〈c ,∞〉.

For the S-estimator with breakdown point of 50% one extra condition hasto be add.

C3Kρ(c)

=12

An example: Tukey’s biweight function ρ.

ρ(x) =

{x2

2 − x4

2c2 + x6

6c4 if |x | ≤ cc2

6 otherwise,

where c is a parameter from condition C2.Jiří Franc Robust regression 16. 11. 2009


S-estimators

Under certain conditions, the behavior of β(S) at the normal model,where (Yi , xi ) are iid random variables satisfying Yi = xT

i β0 + ei with

ei ≈ N(0, σ2) is given by

limn→∞

√n(β(S) − β0

)= N

(0,

E[XTX

]−1 (∫ψ2dΦ

)(∫ψ′dΦ

)2).

Consequently the asymptotic efficiency is defined by

e :=

(∫ψ′dΦ

)2(∫ψ2dΦ

) .The same formulas was derived also for the M-estimator.



S-estimators

An example of the S-estimator corresponding to the Tukey’s function ρ withbreakdown point of 50% is K = EΦ[ρ] and c = 1.547.

EΦ[ρ] =+∞∫−∞

ρ(x)φ(x)dx = 2−c∫−∞

c2

6 φxdx +c∫

−c

x2

2 − x4

2c2 + x6

6c4 φ(x)dx =

= c2

3 Φ(−c) + 12

c∫−c

x2φ(x)dx − 12c2

c∫−c

x4φ(x)dx + 16c4

c∫−c

x6φ(x)dx =

= c2

3 Φ(−c) + 12

(1− 2Φ(−c)− (2c)e−

c22

)−

− 12c2

(3− 6Φ(−c)−

(2c3 + 6c

)e−

c22

)+

+ 16c4

(15− 30Φ(−c)−

(2c5 + 10c3 + 30c

)e−

c22

)and

Kρ(c)

= α ∈ (0; 0.5]



S-estimators

The asymptotic efficiency of the S-estimator corresponding to theTukey’s function ρ for different values of the breakdown point ε∗.

ε∗ e c K50% 28.7% 1.547 0.199545% 37.0% 1.756 0.231240% 42.6% 1.988 0.263435% 56.0% 2.251 0.295730% 66.1% 2.560 0.327825% 75.9% 2.937 0.359320% 84.7% 3.420 0.389915% 91.7% 4.096 0.419410% 96.6% 5.182 0.4475

MM-estimators: high-breakdown and high-efficiency estimators, wherethe initial estimate is obtained with an S-estimator, and it is thenimproved with an M-estimator.



LMS-estimator

LMS is probably the first really applicable 50% breakdown pointestimator introduced by Rousseeuw (1984). The idea of the LMS isto replace the sum operator by a median, which is very robust.

Least median of squares estimator

β(LMS) = arg minβ∈Rp

(med

i(r2

i (β))

).

There always exists a solution for the LMS estimator.The LMS estimator is regression equivariant , scale equivariantand affine equivariant.If p > 1 then the breakdown point of the LMS method is

ε∗ := ([n/2]− p + 2)/n.

The LMS is not asymptotically normal.



LMS-estimator

LMS is probably the first really applicable 50% breakdown pointestimator introduced by Rousseeuw (1984). The idea of the LMS isto replace the sum operator by a median, which is very robust.

Least median of squares estimator

β(LMS) = arg minβ∈Rp

(med

i(r2

i (β))

).

There always exists a solution for the LMS estimator.The LMS estimator is regression equivariant , scale equivariantand affine equivariant.If p > 1 then the breakdown point of the LMS method is

ε∗ := ([n/2]− p + 2)/n.

The LMS is not asymptotically normal.



LTS-estimator

Least trimmed squares estimator

β(LTS) = arg minβ∈Rp

=h∑

i=1

r2(i)(β),

where r2(1) ≤ . . . ≤ r2

(n) are the ordered squared residuals.

There always exists a solution for the LTS-estimator.

The LTS estimator is regression equivariant , scale equivariant andaffine equivariant.

If p > 1, h = [n/2] + [(p + 1)/2] then the breakdown point of theLTS-estimator is

ε∗ := ([n − p]/2 + 1)/n.

The LTS has the same asymptotic efficiency at the normaldistribution as the Huber-type of M-estimator.



LTS-estimator



=h∑

i=1

r2(i)(β),

where r2(1) ≤ . . . ≤ r2


There always exists a solution for the LTS-estimator.

The LTS estimator is regression equivariant , scale equivariant andaffine equivariant.

If p > 1, h = [n/2] + [(p + 1)/2] then the breakdown point of theLTS-estimator is

ε∗ := ([n − p]/2 + 1)/n.

The LTS has the same asymptotic efficiency at the normaldistribution as the Huber-type of M-estimator.



LTS-estimator



=h∑

i=1

r2(i)(β),

where r2(1) ≤ . . . ≤ r2


The drawback of the LTS is that the objective function requiressorting of the squared residuals, which takes O(n log n) operations.

Robust high breakdown point estimators like the LTS can be verysensitive to a very small change of data or to a deletion of even onepoint from data set (i.e. small change of data can really cause alarge change of the estimate).



Examples

● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●

●

●

●

●

●● ●

50 55 60 65 70

05

1015

20

Number of International Calls (in tens of millions) from Belgium dependence on Year

Year

NoC

alls

LegendDatasetOLSM−HampelM−HuberMM−TukeyLTS



Examples

An example of outliers in y-direction.

0 1 2 3 4 5

12

34

56

X

Y




Examples

An example of outliers in x-direction.

0 2 4 6 8 10 12

−5

05

1015

X

Y




Examples

Robust regression in R

m.OLS <- lm(NoCalls~(Year),data=dataB)

library(MASS)

m.Huber <- rlm(NoCalls~(Year),data=dataB,psi = psi.huber)m.Hampel <- rlm(NoCalls~(Year),data=dataB,psi = psi.hampel)m.Tukey <- rlm(NoCalls~(Year),data=dataB,psi = psi.bisquare)m.MM.Bisq <- rlm(NoCalls~(Year),data=dataB, method=’MM’)

library(robustbase) - a package of basic robust statistics.

m.LTS <- ltsreg(NoCalls~(Year),data=dataB)

library(robust) - a package of robust methods.



Conclusion and Discussion

Thank you for attention. The discussion is open.



References

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel,W. A. (1986), Robust Statistics: The Approach Based onInfluence Functions, J. Wiley, New York.Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regressionand Outlier Detection. J. Wiley, New York.Jurečková, J. (2001). Robustní statistické metody, Karolinum,PragueVíšek, J. Á. (2000): Regression with high breakdown point.Robust 2000, 2001, 324 - 356.


Robust regression - Robust estimation of regression ...artax.karlin.mff.cuni.cz/~adaml5am/Seminar/0910z/Franc-prez.pdf · Introduction Robust regression Examples Conclusion Robust

Documents