Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 1 ttp://xkcd.com/539/
© Andrew Ho, Harvard Graduate School of Education
Unit 1c: Detecting Influential Data Points and Assessing Their Impact
Unit 1c – Slide 1http://xkcd.com/539/
• Three measures of atypicality• Leverage x Discrepancy = Influence
• Sensitivity Analyses
© Andrew Ho, Harvard Graduate School of Education Unit 1c– Slide 2
Multiple RegressionAnalysis (MRA)
Multiple RegressionAnalysis (MRA) iiii XXY 22110
Do your residuals meet the required assumptions?
Test for residual
normality
Use influence statistics to
detect atypical datapoints
If your residuals are not independent,
replace OLS by GLS regression analysis
Use Individual
growth modeling
Specify a Multi-level
Model
If time is a predictor, you need discrete-
time survival analysis…
If your outcome is categorical, you need to
use…
Binomial logistic
regression analysis
(dichotomous outcome)
Multinomial logistic
regression analysis
(polytomous outcome)
If you have more predictors than you
can deal with,
Create taxonomies of fitted models and compare
them.
Form composites of the indicators of any common
construct.
Conduct a Principal Components Analysis
Use Cluster Analysis
Use non-linear regression analysis.
Transform the outcome or predictor
If your outcome vs. predictor relationship
is non-linear,
Use Factor Analysis:EFA or CFA?
Today’s Topic Area
Course Roadmap: Unit 1c
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 3
Anscombe’s Quartet: Four datasets with identical summary statisticsSame means, standard deviations, correlations, regression lines, statistics, statistics, and .
A powerful argument for Exploratory Data Analysis.
The model: Unit 1d: Next Class
How might we detect and describe these “atypical” observations?
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 4
* p<0.05, ** p<0.01, *** p<0.001D=Diabetic, A=Asthmatic, ILL=Diabetic or Asthmatic, LAGE=log-Age in months F 23.45 129.0 83.13 70.52 40.21 106.8 df_r 191 190 188 187 182 189 df_m 2 3 5 6 11 4 rss 161.8 66.37 62.77 61.77 58.75 61.82 mss 39.74 135.2 138.8 139.8 142.8 139.7 R-sq 0.197 0.671 0.689 0.694 0.708 0.693 _cons 4.604*** -5.464*** -7.411*** -7.338*** -9.773*** -7.337***ILLxLAGE -0.823***ILL 3.130** AxLAGExSES -0.211 DxLAGExSES -0.505 LAGExSES -0.277 AxSES 0.999 DxSES 2.549 SES -0.0857 1.268 -0.0875 AxLAGE -0.754** -0.788** 0.134 DxLAGE -0.939* -0.899* 0.684 LAGE 2.090*** 2.494*** 2.511*** 3.010*** 2.511***A -0.936*** -0.898*** 2.728* 2.969* -1.457 D -0.837*** -0.968*** 3.591 3.478 -4.444 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Predicting children's understanding of causes of illness (n=194)
When do we check for atypical observations? At least first… and last…
• Atypical observations are often spotted in initial exploratory analyses.• However, an atypical observation may not be revealed until the addition of a
particular predictor. A “multivariate outlier” is a multidimensional generalization of this bivariate outlier: an outlier neither on nor on but an outlier on both.
• In practice, we conduct initial exploratory analyses, build our model, and then conduct additional exploratory analyses on our “final” model.
We should be nervous here that there is some atypical observation that has particular influence on all of these statistics.
We check these assumptions now.
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 5
Classifying Atypical Observations
Discrepancy:Extreme-on-
A Large Residual• Does not necessarily have
influence on the estimated regression coefficients.
• However, a large residual inflates SSError
• Increases MSE, RMSE• Reduces • Increases standard errors• Decreases • Increases -values• More difficult to reject
Leverage:Extreme-on-
Distant on One or More Predictors• Does not necessarily have
influence on the estimated regression coefficients.
• However, the influence is unpredictable and may be quite large.
• Erratic impact on SSE, MSE, RMSE,
• Erratic impact on standard errors, , and hypothesis testing
Influence:A high-discrepancy, high-leverage observation will have a strong influence on estimated regression coefficients and
an impact on all model fit statistics and hypothesis testing.
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 6
Discrepancy (1): The Raw Residual*--------------------------------------------------------------------------------* Refit final regression model, estimate & output selected influence statistics*--------------------------------------------------------------------------------* Refit Model 6, the final regression model: eststo: regress ILLCAUSE ILL LAGE ILLxLAGE SES
* Output and summarize the predicted values ("capture" suppresses an error that * would otherwise be generated if you ran this again, redefining the variable): capture predict PREDICTED, xb summarize PREDICTED
* Output and summarize the residuals: * Raw residuals: capture predict RESID, residuals summarize RESID
Refit Model 6
Save predicted values, , that
is,
Save residuals, ,the “miss”
A one-predictor illustration of residuals:
RESID 194 1.08e-09 .5659794 -1.649514 1.729051 Variable Obs Mean Std. Dev. Min Max
. summarize RESID
The mean of residuals will always be 0.
Why? The standard deviation will be close to the RMSE. Why?
http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html
PREDICTED 205 4.141373 .8490271 2.384122 5.86962 Variable Obs Mean Std. Dev. Min Max
. summarize PREDICTED
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 7
Discrepancy (2, 3): The PRESS and Standardized Residual * PRESS (Predicted Residual Sum of Squares) residuals: capture predict HATSTAT, leverage generate PRESS = RESID/(1-HATSTAT) summarize PRESS
* Standardized residuals: capture predict STDRESID, rstandard summarize STDRESID
Funny thing about atypical observations… they mask themselves. They draw the regression line to themselves, reducing their residuals.
The PRESS residual “unmasks” the atypicality of
an observation by calculating a residual for a regression line that is estimated from a dataset
that does not include the observation itself.
The standardized residual (also, confusingly, the standardized PRESS residual, and the
internally studentized residual) is the PRESS residual expressed in terms of predicted
standard deviations of residuals.
This arguably results in a more interpretable statistic, where a residual of 2 or 3 standard
deviations starts to seem “atypical.”
PRESS 194 -.0002356 .5822095 -1.70661 1.799791 Variable Obs Mean Std. Dev. Min Max
. summarize PRESS
STDRESID 194 -.0002017 1.003628 -2.933572 3.084371 Variable Obs Mean Std. Dev. Min Max
. summarize STDRESID
This does not mean that we should discard
observations with standardized residuals > 3, say. If we did, and
we recalculated standardized residuals,
what might we find?
© Andrew Ho, Harvard Graduate School of EducationUnit 1c / Page 8
This outlier has a large residualThis outlier does not. It masks itself.
• We should consider both the discrepancy from the regression line and the leverage exerted on the regression line.
• With simple linear regression, it looks somewhat familiar (see bottom left)• With many predictors, the leverage of an observation is a single expression of distance
from many predictor means.• Called because leverage is an element of the “hat matrix”: It puts the hat on the .
Leverage: Extremity of an observation along predictor variables
http://www.stat.sc.edu/~west/javahtml/Regression.html
𝑒𝑠𝑡 𝑑𝑖=
𝑌 𝑖−𝑌 𝑖
𝑅𝑀𝑆𝐸 √1− h𝑖
HATSTAT 205 .0258016 .0127337 .0103571 .0827226 Variable Obs Mean Std. Dev. Min Max
. summarize HATSTATLarger values, more leverage. All sum to , so tend to seem small in magnitude.
© Andrew Ho, Harvard Graduate School of Education
302
303
304 305
306
307
308
309
310
311312
313
315
316
317
319
320
321
322
323
324
325
326327
328 329330
331
332
333
334
335
336
420
421423
424
426
427
433
441
442443
444
501
502
503
504
505 506507
508
509510
511
512513
514
515
516
517518519
522525
528
529
530532
534
537
538 539546
548
549
550551
552
553
554555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
572
573
574
575
600
601 602603
605
606
608
609
610
611
612613
614
615
616
617618619
620
621622
624
625
626628629
630
631
632633
634
635
636
637
639
640
641
643650651
652
653
654
700
701
702
703
704
705706707
708
709
710
711
712
713
714
715
716
717
718
719719
720721
722723 724
725726727
728
729
730
731
732
733734
735
736
737
738
739
740
741
742
743744 745
746747
748
0.0
2.0
4.0
6.0
8Le
vera
ge
0 .01 .02 .03 .04 .05Normalized residual squared
Unit 1c – Slide 9
Cook’s Distance: Influence = Discrepancy * Leverage
Discrepancy:Extreme-on-
A Large Residual
Leverage:Extreme-on-
Distant on One or More Predictors
Influence:High Discrepancy, High Leverage
𝑑𝑖=( h𝑖1 − h𝑖 )(
𝑒 𝑠𝑡𝑑 (𝑖 )2
𝑘+1 )Leverage
(horizontal)Squared standardized
residual (vertical)
The lvr2plot
Influence Demo:. findit regpt. regpt
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 10
Exploratory Analysis of Discrepancy, Leverage, and Influence
* Identify cases that are extreme-on-Y: * Plot standardized residuals versus ID to pick out the extreme-on-Y cases: graph twoway (scatter STDRESID ID, mlabel(ID) msize(small)),name(Unit1c_g2,replace) * Sort and list only those cases atypical on the standardized residuals. * Recall that they can be both positive and negative: sort STDRESID list ID STDRESID if STDRESID !=. in F/20, clean list ID STDRESID if STDRESID !=. in -20/L, clean
* Second, identify the cases that are extreme-on-X: * Plot HATSTAT versus ID to pick out the extreme-on-X cases: graph twoway (scatter HATSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g3,replace) * Sort and list only the cases atypical on HATSTAT. * Recall that the HAT statistic measures the "horizontal" distance of each * case from the "center" of the data in the predictor plane. It can only * take on positive values, and so only atypical cases with large positive * and non-missing values are in contention: sort HATSTAT list ID HATSTAT if HATSTAT !=. in -20/L, clean
* Third, identify the cases that are most influential overall: * Plot COOKSDSTAT versus ID to pick out the most influential cases overall: graph twoway (scatter COOKSDSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g4,replace) * Sort and list the atypical cases that are most influential overall. * Recall that Cook's D statistic measures overall impact on the generic fit. * It can only take on positive values and so it is only atypical cases with * large positive and non-missing values that are in contention: sort COOKSDSTAT list ID COOKSDSTAT if COOKSDSTAT !=. in -20/L, clean
List the IDs of the cases with extreme values
Plot The Value Of Each Influence Statistic Versus the Case ID.
Use scatterplots of statistics on ID and list sorted statistics by ID.
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 11
444307 441 617
424310 602433322
530 605 634443 708722546329 606 725330 720705539 563 637 704515 551 618303 631 742508 612 724506 613427 738636 717552309 509507 556559561 739625 732718 746319313 558 730326327 709601565 735707723706516 538 566512 619549331324 701332 537 711334 614302 611 652 737721654560 716570 643569323 426 550 632304 740317 548510 727748562 624639532 568 600 719719622 651311316 700603525 712731421 729610 715529 650 733609528519 628518 633308 620629 747306 420328 640320 442 505 567333 554315 713572 616 743564 741736534 635503517511513 744653615574 710573575 703 734335 728626 714522501 555 641305 630504514312325 608557336321
745502 702726621
423553
-4-2
02
4S
tan
dard
ize
d re
sidu
als
300 400 500 600 700 800Child Identification Code
Exploring Discrepancy with Standardized Residuals: Extreme on
ID STDRESID 1. 444 -2.933572 2. 307 -2.640082 3. 441 -2.495741 4. 617 -2.462099 5. 424 -1.921746 6. 310 -1.759873 7. 602 -1.722051
ID STDRESID 188. 745 2.100213 189. 502 2.155126 190. 702 2.262468 191. 726 2.450768 192. 621 2.522492 193. 423 2.744562 194. 553 3.084371
Note that extreme values of residuals can be positive or negative
© Andrew Ho, Harvard Graduate School of Education
444
307
441
617
424
310
602
433
322
530 605
634
443708
722546329
606
725
330
720
705539
563
637
704
515
551
618
303
631
742
508
612724
506613
427
738
636
717
552309
509
507556559
561
739
625
732
718
746319
313 558
730326327
709
601
565 735
707723706
516
538
566
512
619
549
331
324
701332
537
711
334
614
302 611
652737
721654
560
716
570643
569
323
426
550
632
304
740
317
548
510
727
748562
624
639532
568
600
719719
622
651
311
316
700
603525
712
731
421729
610
715
529
650
733
609
528519
628
518
633
308620
629
747
306
420
328640
320
442505
567
333
554
315
713
572
616
743
564
741736534
635
503
517
511
513
744
653615
574 710
573
575
703734
335
728626
714
522501
555
641305
630
504514
312
325
608
557
336
321
745
502702
726
621423
553
440
314
623576
318
571301
627
638
435
531
0.0
2.0
4.0
6.0
8Le
vera
ge
300 400 500 600 700 800Child Identification Code
Unit 1c – Slide 12
Exploring Leverage with the Hat Statistic: Extreme on ID HATSTAT199. 634 .0540155 200. 331 .0542818 201. 573 .0542818 202. 537 .0563484 203. 322 .0621393 204. 568 .0677331 205. 700 .0827226
Note that extreme values of leverage can only be positive.
© Andrew Ho, Harvard Graduate School of Education
504508330
327316332 512
514
713710
602
716640601427 730574507705329707701538
303603328326 552
525
522 546610513
501615309 506 728
319539
706612559549
423
556505 532333 639703 741723 747
738632
621
736732569 633724572
605
746653
530 630
442613
722626534421
734
712
708555
426 622510 628600554 733735
443
608
717614
702
725
304 729
742
305
565 629 739509317308
563
727652
502
323 420311
641
726
315560 620
511516
312
737548
551
321
609334 566503
731
617
550558
704
564306313 709721654 748611624
575
619 715302
618606
444
650
441
519528
335
718643
720
518 651562567
310
517320
557
570
714
515 631
711616
307
719719
553
433325
636
740
635
637
324
745
744
529
743625
336424
561
634
331
573
537
322
568 7000.0
2.0
4.0
6.0
8C
ook
's D
300 400 500 600 700 800Child Identification Code
Unit 1c – Slide 13
Exploring Influence with Cook’s Distance: Extreme on and ID HATSTAT188. 336 .0387276 189. 424 .0418666 190. 441 .0432583 191. 745 .0460144 192. 307 .0557836 193. 444 .0595759 194. 553 .0778421
Note that extreme values of Cook’s Distance can only be positive.
© Andrew Ho, Harvard Graduate School of Education
302
303
304 305
306
307
308
309
310
311312
313
315
316
317
319
320
321
322
323
324
325
326327
328 329330
331
332
333
334
335
336
420
421423
424
426
427
433
441
442443
444
501
502
503
504
505 506507
508
509510
511
512513
514
515
516
517518519
522525
528
529
530532
534
537
538 539546
548
549
550551
552
553
554555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
572
573
574
575
600
601 602603
605
606
608
609
610
611
612613
614
615
616
617618619
620
621622
624
625
626628629
630
631
632633
634
635
636
637
639
640
641
643650651
652
653
654
700
701
702
703
704
705706707
708
709
710
711
712
713
714
715
716
717
718
719719
720721
722723 724
725726727
728
729
730
731
732
733734
735
736
737
738
739
740
741
742
743744 745
746747
748
0.0
2.0
4.0
6.0
8Le
vera
ge
0 .01 .02 .03 .04 .05Normalized residual squared
Unit 1c – Slide 14
Back to the Leverage by Residual Squared PlotFor our sensitivity analysis, we’ll take a look at these five points. In a real analysis, we would conduct a full substantive investigation of these children using all available qualitative and quantitative data.
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 15
205. 553 1 1 0 4.571 2.841949 1 80 4 204. 423 0 1 0 5.143 3.585113 1 112 2 203. 307 0 1 0 2.857 4.337627 1 194 4 202. 444 0 1 0 2.571 4.220514 1 181 4 201. 700 0 0 1 3 2.910231 0 68 4 ID HICOOK HIRESID HIHAT ILLCAUSE PREDIC~D ILL AGE SES
*--------------------------------------------------------------------------------* Describe atypical observations.*--------------------------------------------------------------------------------
* Identify atypical cases by their ID #s from the previous analysis. * First, tag high-discrepancy observations: generate HIRESID=(ID==444|ID==307|ID==423|ID==553) * Second, tag high-leverage observations: generate HIHAT=(ID==700) * Third, tag cases w/ high overall influence: generate HICOOK=(ID==553)
* Sort the atypical cases by their ID and selected characteristics: sort HICOOK HIRESID HIHAT ILLCAUSE ILL AGE SES ID * List cases for inspection, in a table, sorted by type of atypicality: list ID HICOOK HIRESID HIHAT ILLCAUSE PREDICTED ILL AGE SES /// if HIRESID==1 | HIHAT==1 | HICOOK==1, sepby(HICOOK HIRESID HIHAT)
Exploring Atypical Observations
Classify atypical observations by their basis for being flagged.
Sort and list cases by selected variables.
Similar observations may be chunked together for sensitivity analyses, e.g., #307 & #444
© Andrew Ho, Harvard Graduate School of Education
700
444307
423
553
44.
55
5.5
LAG
E
1 2 3 4 5Hollingshead SES Unit 1c – Slide 16
Bivariate Visualizations of Atypical Observations
700
444
307
423
553
23
45
6U
nde
rsta
ndi
ng
of I
llnes
s C
aus
alit
y S
core
4 4.5 5 5.5LAGE
700
444
307
423
553
23
45
6U
nde
rsta
ndi
ng
of I
llnes
s C
aus
alit
y S
core
1 2 3 4 5Hollingshead SES
© Andrew Ho, Harvard Graduate School of Education Unit 1c – Slide 17
* p<0.05, ** p<0.01, *** p<0.001ILL=Diabetic or Asthmatic, LAGE=log-Age in months F 106.8 105.3 114.2 111.8 111.6 110.8 116.5 df_r 189 188 188 188 188 188 187 df_m 4 4 4 4 4 4 4 rss 61.82 61.82 58.71 59.36 59.01 59.54 56.53 mss 139.7 138.4 142.6 141.2 140.1 140.4 140.9 R-sq 0.693 0.691 0.708 0.704 0.704 0.702 0.714 _cons -7.337*** -7.362*** -7.323*** -7.344*** -7.350*** -7.349*** -7.363***SES -0.0875 -0.0892 -0.104* -0.0789 -0.0722 -0.0737 -0.0572 ILLxLAGE -0.823*** -0.829*** -0.754** -0.806*** -0.764** -0.760** -0.695** LAGE 2.511*** 2.517*** 2.515*** 2.510*** 2.508*** 2.509*** 2.505***ILL 3.130** 3.160** 2.794* 3.026** 2.845* 2.827* 2.519* Baseline No 700 No 553 No 423 No 444 No 307 No307404 Sensitivity of predictions of children's understanding of causes of illness to atypical observations (n=194)
Effects of ILL and logAGE – both as main effects and in a two-way interaction – are fairly robust to the removal of the atypical datapoints. The basic substantive story is preserved throughout.
Sensitivity Analysis
• The main effect of SES fluctuates in magnitude and significance as atypical observations are excluded.
• The usefulness of socioeconomic status in the prediction of ILLCAUSE is sensitive to the inclusion of atypical observations in this model.
• In practice, we typically restrict investigation to high influence points, unless there is a serious question about whether the other atypical observations are actually part of the target population.
Goodness-of-fit ( statistic) is robust to the removal of
atypical observations.
It even rises a little when Low-SES children are omitted, due to the removal of two large negative residuals from SSE.