Weighted regression - worked example
The problem of heteroscedasticity
In regression analysis heteroscedasticity means a situation in which the variance of the dependent variable (Y) varies across the levels of the independent data (X). Heteroscedasticity can complicate analysis because regression analysis is based on an assumption of equal variance across the levels of the independent data.
Homoscedasticity | Heteroscedasticity |
Weighted regression can be used to correct for heteroscedasticity. In a Weighted regression procedure more weight is given to the observations with smaller variance because these observations provide more reliable information about the regression function than those with large variances.
Neter et al. (1996). suggest the following process for estimating the regression coefficients in the presence of heteroscedasticity:
- Fit the regression model by unweighted least squares and analyze the residuals.
- Estimate the variance function or the standard deviation function by regressing either the squared residuals or the absolute residuals on the appropriate predictor(s).
- Use the fitted values from the estimated variance or standard deviation function to obtain the weights wi.
- Estimate the regression coefficients using these weights.
How to do this automatically
MedCalc will perform these steps automatically when you select the dummy variable "*** AutoWeight 1/SD^2 ***" for "Weights" in the dialog boxes for regression.
How to perform each step in MedCalc
If you would like to have more control over the process, perhaps because you require some modifications of one or more steps, you can perform each of these steps using Weighted regression and other tools available in MedCalc.
This process is described below in detail.
The data for this example are available in the MedCalc sample files folder, file "Weighted regression (Neter).mc1". This file contains Age and Diastolic blood pressure (DBP) data collected on 54 subjects.
Step 1. Fit the regression model by unweighted least squares and analyze the residuals
We click Regression on the Statistics menu and complete the dialog box as follows.
Variable Y, the dependent variable is DBP (Diastolic blood pressure) and Variable X, the independent variable is Age.
We do not select a variable for Weights because in this first step we perform ordinary unweighted least squares regression.
We obtain the following results:
In the results window, we click the hyperlink "Save residuals" to save the residuals in a new column of the spreadsheet.
Residuals are the differences between the observed values of the dependent variable DBP and the values calculated using the regression equation.
In the subsequent dialog box, we click OK.
This will create a new column in the spreadsheet containing the residuals (variable "REGR_Resid1"):
Step 2. Estimate the variance function or the standard deviation function
In this step, we build a regression model of the standard deviation against Age. We do that by regressing the absolute values of the residuals against Age, since the absolute residuals are an estimator of the standard deviation of DBP at different values of Age.
We click Regression on the Statistics menu and complete the dialog box as follows:
For Variable Y, we first select the new variable "REGR_Resid1" and next edit the selection and change the variable into "abs(REGR_Resid1)".
We obtain the following results:
Step 3. Use the fitted values from the estimated variance or standard deviation function to obtain the weights
In the last results window, we click the hyperlink "Save predicted values" to save the predicted values in a new column of the spreadsheet.
In the subsequent dialog box, we click OK.
This will create a new column in the spreadsheet containing the predicted (or estimated) values of the standard deviation (variable "REGR_Pred1"):
Step 4. Estimate the regression coefficients using these weights
Finally, we can build our weighted regression model.
For weights we use the reciprocal of the squared predicted values for standard deviation (variance is the standard deviation squared): observations with large standard deviation are given less weight than observations with smaller standard deviation.
We click Regression on the Statistics menu and complete the dialog box as follows.
For Weights, we first select the new variable "REGR_Pred1" and next edit the selection and change the variable into "1/REGR_Pred1^2" (we could also use "1/(REGR_Pred1*REGR_Pred1)" or "1/Power(REGR_Pred1,2)".
We obtain the following results:
The final (weighted) regression equation is
DBP = 55.5658 + 0.5963 Age
which is not much different from the original (unweighted) regression equation
DBP = 56.1569 + 0.5800 Age
However, the standard errors of the regression coefficients are smaller, resulting in more narower confidence intervals.
Literature
- Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models. 4th ed. Boston: McGraw-Hill.