Multiple Regression (Part 3) Diagnostics
In the exercises below we cover some more material on multiple regression diagnostics in R. This includes added variable (partial-regression) plots, component+residual (partial-residual) plots, CERES plots, VIF values, tests for heteroscedasticity (nonconstant variance), tests for Normality, and a test for autocorrelation of residuals. These are perhaps not as common as what we have seen in Multiple Regression (Part 2), but their aid in investigating our model’s assumptions is valuable.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Multiple Regression (Part 2) Diagnostics can be found here.
As usual, we will be using the dataset
state.x77, which is part of the
state datasets available in
R. (Additional information about the dataset can be obtained by running
First, please run the following code to obtain and format the data as usual:
state77 <- as.data.frame(state.x77)
names(state77) <- "Life.Exp"
names(state77) <- "HS.Grad"
For the model with
Life.Exp as dependent variable, and
Murder as predictors, suppose we would like to study the marginal effect of each predictor variable, given that the other predictor is in the model.
a. Use a function from the
car package to obtain added-variable (partial regression) plots for this purpose.
b. Re-create the added-variable plots from part a., labeling the two most influential points in the plots (according to Mahalanobis distance).
Illiteracy is highly correlated with both
Murder. To illustrate problems that occur when multicollinearity exists, suppose we would like to study the marginal effect of
Illiteracy (only), given that
Murder are in the model. Use a function from the
car package to get the relevant added-variable plot.
b. From the correlation matrix in the previous Exercise Set, we know that
Area are the least strongly correlated variables with
Life.Exp. Create added-variable plots for each of these two variables, given that all other six variables are in the model.
Consider the model with
Area as predictors. Create component+residual (partial-residual) plots for this model.
Create CERES plots for the model in Exercise 3.
As an illustration of high collinearities, compute VIF (Variance Inflation Factor) values for a model with
Life.Exp as the response, that includes all the variables as predictors. Which variables seem to be causing the most problems?
Using a function from the package
lmtest, conduct a Breusch-Pagan test for heteroscedasticity (non-constant variance) for the model in Exercise 1.
Re-do the test in the previous exercise by using a function from the
The test in Exercise 6 (and 7) is for linear forms of heteroscedasticity. To test for nonlinear heteroscedasticity (e.g., “bowtie-shape” in a residual plot), conduct White’s test.
a. Conduct the Kolmogorov-Smirnov normality test for the residuals from the model in Exercise 1.
b. Now conduct the Shapiro-Wilk normality test.
Note: More Normality tests can be found in the
For illustration purposes only, conduct the Durbin-Watson test for autocorrelation in residuals. (NOTE: This test is ONLY appropriate when the response variable is a time series, or somehow time-related (e.g., ordered by data collection time.))