Multiple regression

Command:

Statistics

Regression

Multiple regression

Description

Multiple regression is a statistical method used to examine the relationship between one dependent variable Y and one or more independent variables X_i. The regression parameters or coefficients b_i in the regression equation

$$ Y = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3 + ... + b_k X_k $$

are estimated using the method of least squares. In this method, the sum of squared residuals between the regression plane and the observed values of the dependent variable are minimized. The regression equation represents a (hyper)plane in a k+1 dimensional space in which k is the number of independent variables X₁, X₂, X₃, ... X_k, plus one dimension for the dependent variable Y.

Required input

The following need to be entered in the Multiple regression dialog box:

Multiple regression - dialog box

Dependent variable

The variable whose values you want to predict.

Independent variables

Select at least one variable you expect to influence or predict the value of the dependent variable. Also called predictor variables or explanatory variables.

Weights

Optionally select a variable containing relative weights that should be given to each observation (for weighted multiple least-squares regression). Select the dummy variable "*** AutoWeight 1/SD^2 ***" for an automatic weighted regression procedure to correct for heteroscedasticity (Neter et al., 1996). This dummy variable appears as the first item in the drop-down list for Weights.

Filter

Optionally enter a data filter in order to include only a selected subgroup of cases in the analysis.

Options

Method: select the way independent variables are entered into the model.
- Enter: enter all variables in the model in one single step, without checking
- Forward: enter significant variables sequentially
- Backward: first enter all variables into the model and next remove the non-significant variables sequentially
- Stepwise: enter significant variables sequentially; after entering a variable in the model, check and possibly remove variables that became non-significant.
Enter variable if P<
A variable is entered into the model if its associated significance level is less than this P-value.
Remove variable if P>
A variable is removed from the model if its associated significance level is greater than this P-value.
Report Variance Inflation Factor (VIF): option to show the Variance Inflation Factor in the report. A high Variance Inflation Factor is an indicator of multicollinearity of the independent variables. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.
Zero-order and simple correlation coefficients: option to create a table with correlation coefficients between the dependent variable and all independent variables separately, and between all independent variables.
Residuals: you can select an optional Test for Normal distribution of the residuals.

Results

After clicking OK the following results are displayed in the results window:

Multiple regression - results

In the results window, the following statistics are displayed:

Sample size: the number of data records n

Coefficient of determination R²: this is the proportion of the variation in the dependent variable explained by the regression model, and is a measure of the goodness of fit of the model. It can range from 0 to 1, and is calculated as follows:

$$ R^2 = \frac {explained\ variation} {total\ variation} = \frac {\sum_{}^{}{(Y_{est}-\bar{Y})^2}} {\sum_{}^{}{(Y-\bar{Y})^2}} $$

where Y are the observed values for the dependent variable, $\bar{Y}$ is the average of the observed values and Y_est are predicted values for the dependent variable (the predicted values are calculated using the regression equation).

R²-adjusted: this is the coefficient of determination adjusted for the number of independent variables in the regression model. Unlike the coefficient of determination, R²-adjusted may decrease if variables are entered in the model that do not add significantly to the model fit.

$$ R^2_{adj} = 1 - \frac {unexplained\ variation / (n-k-1)} {total\ variation / (n-1) } $$

$$ R^2_{adj} = 1 - \frac {\sum_{}^{}{(Y_{est}-\bar{Y})^2}} {\sum_{}^{}{(Y-\bar{Y})^2}} \frac {n-1}{n-k-1} $$

Multiple correlation coefficient: this coefficient is a measure of how tightly the data points cluster around the regression plane, and is calculated by taking the square root of the coefficient of determination.

When discussing multiple regression analysis results, generally the coefficient of multiple determination is used rather than the multiple correlation coefficient.

Residual standard deviation: the standard deviation of the residuals (residuals = differences between observed and predicted values). It is calculated as follows:

$$s_{res} = \sqrt{\frac{\sum_{}^{}{(Y-Y_{est})^2}}{n-k-1}} $$

The regression equation: the different regression coefficients b_i with standard error s_bi, 95% Confidence Interval, t-value, P-value, partial and semipartial correlation coefficients r_partial and r_semipartial.

If P is less than the conventional 0.05, the regression coefficient can be considered to be significantly different from 0, and the corresponding variable contributes significantly to the prediction of the dependent variable.
Partial correlation coefficient r_partial: partial correlation is the correlation between an independent variable and the dependent variable after the linear effects of the other variables have been removed from both the independent variable and the dependent variable (the correlation of the variable with the dependent variable, adjusted for the effect of the other variables in the model).
Semipartial correlation coefficient r_semipartial (in SPSS called part correlation): semipartial correlation is the correlation between an independent variable and the dependent variable after the linear effects of the other independent variables have been removed from the independent variable only. The squared semipartial correlation is the proportion of (unique) variance accounted for by the independent variable, relative to the total variance of the dependent variable Y.
Optionally the table includes the Variance Inflation Factor (VIF). A high Variance Inflation Factor is an indicator of multicollinearity of the independent variables. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.

Variables not included in the model: variables are not included in the model because of 2 possible reasons:

You have selected a stepwise model and the variable was removed because the P-value of its regression coefficient was above the threshold value.
The tolerance of the variable was very low (less than 0.0001). The tolerance is the inverse of the Variance Inflation Factor (VIF) and equals 1 minus the squared multiple correlation of this variable with all other independent variables in the regression equation. If the tolerance of a variable in the regression equation is very small then the regression equation cannot be evaluated.

Analysis of variance: the analysis of variance table divides the total variation in the dependent variable into two components, one which can be attributed to the regression model (labeled Regression) and one which cannot (labeled Residual). If the significance level for the F-test is small (less than 0.05), then the hypothesis that there is no (linear) relationship can be rejected, and the multiple correlation coefficient can be called statistically significant.

Zero-order and simple correlation coefficients: this optional table shows the correlation coefficients between the dependent variable (Y) and all independent variables X_i separately, and between all independent variables.

Analysis of residuals

Multiple linear regression analysis assumes that the residuals (the differences between the observations and the estimated values) follow a Normal distribution. This assumption can be evaluated with a formal test, or by means of graphical methods.

The different formal Tests for Normal distribution may not have enough power to detect deviation from the Normal distribution when sample size is small. On the other hand, when sample size is large, the requirement of a Normal distribution is less stringent because of the central limit theorem.

Therefore, it is often preferred to visually evaluate the symmetry and peakedness of the distribution of the residuals using the Histogram, Box-and-whisker plot, or Normal plot.

To do so, you click the hyperlink "Save residuals" in the results window. This will save the residual values as a new variable in the spreadsheet. You can then use this new variable in the different distribution plots.

Repeat procedure

If you want to repeat the Multiple regression procedure, possibly to add or remove variables in the model, then you only have to press function key F7. The dialog box will re-appear with the previous entries and selections (see Recall dialog).

Literature

Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall.
Armitage P, Berry G, Matthews JNS (2002) Statistical methods in medical research. 4^th ed. Blackwell Science.
Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models. 4^th ed. Boston: McGraw-Hill.

Multiple regression

Description

Required input

Dependent variable

Independent variables

Weights

Filter

Options

Results

Analysis of residuals

Repeat procedure

Literature

See also