Regression

Command    

Statistics
Next selectRegression
Next selectRegression

Description

Regression analysis is a statistical method used to describe the relationship between two variables and to predict one variable from another (if you know one variable, then how well can you predict a second variable?).

Whereas for correlation the two variables need to have a Normal distribution, in regression analysis only the dependent variable Y should have a Normal distribution. The variable X does not need to be a random sample with a Normal distribution (the values for X can be chosen by the experimenter). However, the variability of Y should be the same for each value of X.

Required input

When you select Regression in the menu, the following box appears on the screen:

Dialog box for regression

In this dialog box you identity 2 variables. If you want to select the variables from the variables list, click the Drop-down button button and now you can select the variable in the list. Next, you move the cursor to the Independent X field, and again you select the Drop-down button button to select the variable in the list.

Optionally, you may also enter selection criteria in order to include only a selected subgroup of cases in the statistical analysis. Again, you can select the Drop-down button button to obtain a list of selection criteria already used for the current data.

Finally, a regression equation (regression model, equation of approximating curve) has to be selected. The program offers a choice of 5 different equations:

 

Y = a + b X straight line
Y = a + b Log(X) logarithmic curve
Log(Y) = a + b X exponential curve
Log(Y) = a + b Log(X)         geometric curve
Y = a + b X + c X2 quadratic regression (parabola)

 

where X represents the independent variable and Y the dependent variable. The coefficients a, b and c are calculated by the program using the method of least squares.

Results

The following statistics will be displayed in the results window:

Regression results

Sample size: the number of data pairs n

Coefficient of determination R2: this is the proportion of the variation in the dependent variable explained by the regression model, and is a measure of the goodness of fit of the model. It can range from 0 to 1, and is calculated as follows:


where Y are the observed values for the dependent variable, is the average of the observed values and Yest are predicted values for the dependent variable (the predicted values are calculated using the regression equation).

Residual standard deviation: the standard deviation of the residuals (residuals = differences between observed and predicted values). It is calculated as follows:


The residual standard deviation is sometimes called the Standard error of estimate (Spiegel, 1961).

The equation of the regression curve: the selected equation with the calculated values for a and b (and for a parabola a third coefficient c). E.g. Y = a + b X

Next, the standard errors are given for the intercept (a) and the slope (b), followed by the t-value and the P-value for the hypothesis that these coefficients are equal to 0. If the P-values are low (e.g. less than 0.05), then you can conclude that the coefficients are different from 0.

Note that when you use the regression equation for prediction, you may only apply it to values in the range of the actual observations. E.g. when you have calculated the regression equation for height and weight for school children, this equation cannot be applied to adults.

Analysis of variance: the analysis of variance table divides the total variation in the dependent variable into two components, one which can be attributed to the regression model (labeled Regression) and one which cannot (labelled Residual). If the significance level for the F-test is small (less than 0.05), then the hypothesis that there is no (linear) relationship can be rejected.

Presentation of results

If the analysis shows that the relationship between the two variables is too weak to be of practical help, then there is little point in quoting the equation of the fitted line or curve. If you give the equation, you also report the standard error of the slope, together with the corresponding P-value. Also the residual standard deviation should be reported (Altman, 1980). The number of decimal places of the regression coefficients should correspond to the precision of the raw data.

The accompanying scatter diagram should include the fitted regression line when this is appropriate. This figure can also include the 95% confidence interval, or the 95% prediction interval, which can be more informative, or both. The legend of the figure must clearly identify the interval that is represented.

Literature

  • Armitage P, Berry G, Matthews JNS (2002) Statistical methods in medical research. 4th ed. Blackwell Science.
  • Bland M (2000) An introduction to medical statistics, 3rd ed. Oxford: Oxford University Press.
  • Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall.

See also

External links

Privacy Contact Site map