How to spot outliers in regression analysis
Much of the debate over the possible pause in surface temperatures since 1998 really hinges on 1998 being an outlier. And not only an outlier but an influential data point, which means that its very presence changes the overall regression trend. In this post, I'll show how to identify outliers, high-leverage data points, and influential data points.
First, some basic definitions. An outlier is any data point that falls outside the normal range for that data set, usually set as being 2 standard deviations from the average. In regression analyses, an outlier is any data point where its residual falls outside the normal range. High leverage data points are made at extreme values for the independent variables such that there are few other data points around, regardless of whether or not those data points change the overall trend. An influential data point is an extreme outlier with high leverage that alters the overall trend.
Now for the analysis, starting with the basics. First, create the regression model, using the subset argument to limit the time period.
The default diagnostic plots reveal a good fit except for several months during 1998, which is especially obvious in the Residual vs Fitted plot. Digging deeper requires more tools than the default offers, which the car package (Companion to Applied Regression) offers.
Standardized residuals above 2 and below -2 are outliers. Points with Hat-Values above 0.025 and standardized residuals between -2 and 2 are high leverage points. Data points with Hat-Values above 0.025 and standardized residuals above 2 and below -2 are influential points that significantly alter the overall trend. According to this analysis, there are multiple outliers, with two influential points and one boarderline. In R, the code I gave will allow you to directly identify the points by clicking on them. The two influential points are numbers 2989 and 2990 which can then be pulled out of the main data frame.
Excluding just those three points raised the estimated trend from 0.005656ºC per year to 0.008642ºC, showing that those three outliers artificially tilted the trend toward showing less warming and demonstrating the main problem with either starting or ending regression trends with extreme outliers.
A sharp observer may ask why I did not use a generalized least squares (gls) analysis from the nlme package to factor out autocorrelation for my examples. The reason is simply that the tools in the car package are not built to handle gls analyses. Additionally, autocorrelation does not really matter for identifying outliers, high-leverage points, and influential points.
First, some basic definitions. An outlier is any data point that falls outside the normal range for that data set, usually set as being 2 standard deviations from the average. In regression analyses, an outlier is any data point where its residual falls outside the normal range. High leverage data points are made at extreme values for the independent variables such that there are few other data points around, regardless of whether or not those data points change the overall trend. An influential data point is an extreme outlier with high leverage that alters the overall trend.
Now for the analysis, starting with the basics. First, create the regression model, using the subset argument to limit the time period.
Model=lm(UAH~Time, data=Climate, subset=Time>=1998)
summary(Model)
Call:par(mfrow=c(2,2) #Create a 2 x 2 plot matrix
lm(formula = UAH ~ Time, data = Climate, subset = Year.1 >=
1998)
Residuals:
Min 1Q Median 3Q Max
-0.47575 -0.11244 0.01165 0.09604 0.53415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.176493 5.631428 -1.985 0.0487 *
Time 0.005656 0.002808 2.015 0.0454 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1741 on 186 degrees of freedom
Multiple R-squared: 0.02135, Adjusted R-squared: 0.01609
F-statistic: 4.059 on 1 and 186 DF, p-value: 0.04539
plot(Model) #Default diagnostic plots
The default diagnostic plots reveal a good fit except for several months during 1998, which is especially obvious in the Residual vs Fitted plot. Digging deeper requires more tools than the default offers, which the car package (Companion to Applied Regression) offers.
install.packages(car)
library(car)
influencePlot(Model, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's Distance")
Standardized residuals above 2 and below -2 are outliers. Points with Hat-Values above 0.025 and standardized residuals between -2 and 2 are high leverage points. Data points with Hat-Values above 0.025 and standardized residuals above 2 and below -2 are influential points that significantly alter the overall trend. According to this analysis, there are multiple outliers, with two influential points and one boarderline. In R, the code I gave will allow you to directly identify the points by clicking on them. The two influential points are numbers 2989 and 2990 which can then be pulled out of the main data frame.
Climate[2989,]The boarderline point is 2992.
Year Month Time GISS UAH HadCRUT4 NCDC HadSST3 RSS PDO AMOClimate[2990,]
2989 1998 1 1998 0.6 0.47 0.488 0.5967 0.419 0.55 0.83 0.15
MEI Sunspots TSI AOD Arctic.Ice Antarctic.Ice CO2
2.483 31.9 1365.913 0.004 14.67 4.46 365.18
Year Month Time GISS UAH HadCRUT4 NCDC HadSST3 RSS PDO AMO
2990 1998 2 1998.08 0.86 0.65 0.754 0.8501 0.478 0.736 1.56 0.311
MEI Sunspots TSI AOD Arctic.Ice Antarctic.Ice CO2
2.777 40.3 1365.808 0.0037 15.7 2.99 365.98
Climate[2992,]Now that those points are identified, we can determine how much they influence the trend by rerunning the analysis while excluding those points.
Year Month Time GISS UAH HadCRUT4 NCDC HadSST3 RSS PDO AMO
2992 1998 4 1998.25 0.62 0.66 0.621 0.7371 0.489 0.857 1.27 0.315
MEI Sunspots TSI AOD Arctic.Ice Antarctic.Ice CO2
2.673 53.4 1365.928 0.0031 14.84 6.85 368.61
Climate.ex = Climate[c(-2989, -2990, -2992),]
Model.2 = lm(UAH~Time, data=Climate.ex, subset=Time>=1998)
summary(Model.2)
Call:
lm(formula = UAH ~ Time, data = Climate.ex, subset = Time >= 1998)
Residuals:
Min 1Q Median 3Q Max
-0.45049 -0.11412 0.01253 0.08884 0.46394
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.174211 5.429297 -3.163 0.00183 **
Time 0.008642 0.002707 3.193 0.00166 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1639 on 183 degrees of freedom
Multiple R-squared: 0.05277, Adjusted R-squared: 0.0476
F-statistic: 10.2 on 1 and 183 DF, p-value: 0.001658
Excluding just those three points raised the estimated trend from 0.005656ºC per year to 0.008642ºC, showing that those three outliers artificially tilted the trend toward showing less warming and demonstrating the main problem with either starting or ending regression trends with extreme outliers.
A sharp observer may ask why I did not use a generalized least squares (gls) analysis from the nlme package to factor out autocorrelation for my examples. The reason is simply that the tools in the car package are not built to handle gls analyses. Additionally, autocorrelation does not really matter for identifying outliers, high-leverage points, and influential points.
Comments
Post a Comment