Sunday, October 20, 2013

How to spot outliers in regression analysis

Much of the debate over the possible pause in surface temperatures since 1998 really hinges on 1998 being an outlier.  And not only an outlier but an influential data point, which means that its very presence changes the overall regression trend.  In this post, I'll show how to identify outliers, high-leverage data points, and influential data points.

First, some basic definitions.  An outlier is any data point that falls outside the normal range for that data set, usually set as being 2 standard deviations from the average.  In regression analyses, an outlier is any data point where its residual falls outside the normal range.  High leverage data points are made at extreme values for the independent variables such that there are few other data points around, regardless of whether or not those data points change the overall trend.  An influential data point is an extreme outlier with high leverage that alters the overall trend.

Now for the analysis, starting with the basics.  First, create the regression model, using the subset argument to limit the time period.

Model=lm(UAH~Time, data=Climate, subset=Time>=1998)
summary(Model)
Call:
lm(formula = UAH ~ Time, data = Climate, subset = Year.1 >=
    1998)

Residuals:
     Min       1Q   Median       3Q      Max
-0.47575 -0.11244  0.01165  0.09604  0.53415

Coefficients:
                      Estimate      Std. Error    t value    Pr(>|t|)
(Intercept)   -11.176493    5.631428    -1.985     0.0487 *
Time            0.005656     0.002808     2.015     0.0454 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1741 on 186 degrees of freedom
Multiple R-squared: 0.02135,    Adjusted R-squared: 0.01609
F-statistic: 4.059 on 1 and 186 DF,  p-value: 0.04539
par(mfrow=c(2,2)  #Create a 2 x 2 plot matrix
plot(Model)  #Default diagnostic plots 



The default diagnostic plots reveal a good fit except for several months during 1998, which is especially obvious in the Residual vs Fitted plot.  Digging deeper requires more tools than the default offers, which the car package (Companion to Applied Regression) offers.
install.packages(car)
library(car)
influencePlot(Model, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's Distance")


Standardized residuals above 2 and below -2 are outliers.  Points with Hat-Values above 0.025 and standardized residuals between -2 and 2 are high leverage points.  Data points with Hat-Values above 0.025 and standardized residuals above 2 and below -2 are influential points that significantly alter the overall trend.  According to this analysis, there are multiple outliers, with two influential points and one boarderline.  In R, the code I gave will allow you to directly identify the points by clicking on them.  The two influential points are numbers 2989 and 2990 which can then be pulled out of the main data frame.
Climate[2989,]
          Year  Month Time  GISS  UAH   HadCRUT4   NCDC   HadSST3    RSS    PDO    AMO
2989  1998     1       1998     0.6     0.47       0.488          0.5967     0.419       0.55     0.83      0.15
            MEI    Sunspots      TSI         AOD    Arctic.Ice    Antarctic.Ice    CO2
          2.483       31.9        1365.913  0.004      14.67             4.46            365.18
Climate[2990,]
          Year    Month    Time     GISS   UAH   HadCRUT4   NCDC   HadSST3   RSS    PDO   AMO
2990   1998     2        1998.08   0.86     0.65         0.754        0.8501       0.478      0.736  1.56    0.311
           MEI     Sunspots      TSI         AOD     Arctic.Ice   Antarctic.Ice    CO2
          2.777       40.3        1365.808   0.0037       15.7          2.99             365.98
The boarderline point is 2992.
Climate[2992,]
          Year   Month    Time      GISS   UAH   HadCRUT4   NCDC   HadSST3    RSS    PDO   AMO  
2992  1998     4         1998.25   0.62     0.66       0.621          0.7371      0.489       0.857    1.27    0.315
            MEI    Sunspots      TSI          AOD      Arctic.Ice    Antarctic.Ice     CO2
           2.673     53.4        1365.928   0.0031         14.84             6.85           368.61
Now that those points are identified, we can determine how much they influence the trend by rerunning the analysis while excluding those points.
Climate.ex = Climate[c(-2989, -2990, -2992),]
Model.2 = lm(UAH~Time, data=Climate.ex, subset=Time>=1998)
summary(Model.2)
Call:
lm(formula = UAH ~ Time, data = Climate.ex, subset = Time >= 1998)

Residuals:
     Min       1Q   Median       3Q      Max
-0.45049 -0.11412  0.01253  0.08884  0.46394

Coefficients:
                    Estimate        Std. Error    t value    Pr(>|t|)  
(Intercept)   -17.174211   5.429297     -3.163     0.00183 **
Time              0.008642    0.002707      3.193     0.00166 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1639 on 183 degrees of freedom
Multiple R-squared: 0.05277,    Adjusted R-squared: 0.0476
F-statistic:  10.2 on 1 and 183 DF,  p-value: 0.001658

Excluding just those three points raised the estimated trend from 0.005656ºC per year to 0.008642ºC, showing that those three outliers artificially tilted the trend toward showing less warming and demonstrating the main problem with either starting or ending regression trends with extreme outliers.

A sharp observer may ask why I did not use a generalized least squares (gls) analysis from the nlme package to factor out autocorrelation for my examples.  The reason is simply that the tools in the car package are not built to handle gls analyses.  Additionally, autocorrelation does not really matter for identifying outliers, high-leverage points, and influential points.

No comments:

Post a Comment