Scatter diagram & regression line
Command: | Statistics Regression Scatter diagram & regression line |
Description
In a scatter diagram, the relation between two numerical variables is presented graphically. One variable (the independent variable X) defines the horizontal axis and the other (dependent variable Y) defines the vertical axis. The values of the two variables on the same row in the data spreadsheet, give the points in the diagram.
Required input
The dialog box for the scatter diagram is similar to the one for Regression:
Variables
- Variable Y and Variable X: select the dependent and independent variables Y and X.
- Weights: select a variable containing relative weights that should be given to each observation (for weighted least-squares regression). Select the dummy variable "*** AutoWeight 1/SD^2 ***" for an automatic weighted regression procedure to correct for heteroscedasticity (Neter et al., 1996). This dummy variable appears as the first item in the drop-down list for Weights.
- Filter: you may also enter a data filter in order to include only a selected subgroup of cases in the statistical analysis.
Regression equation
By default the option Include constant in equation is selected. This is the recommended option that will result in ordinary least-squares regression. When you need regression through the origin (no constant a in the equation), you can uncheck this option (an example of when this is appropriate is given in Eisenhauer, 2003).
MedCalc offers a choice of 5 different regression equations (x represents the independent variable and y the dependent variable):
y | = | a + b x | straight line |
y | = | a + b log(x) | logarithmic curve |
log(y) | = | a + b x | exponential curve |
log(y) | = | a + b log(x) | geometric curve |
y | = | a + b x + c x2 | quadratic regression (parabola) |
When you select an equation that contains a Logarithmic transformation for one of the variables, the program will use a logarithmic scale for the corresponding variable.
Options
- 95% Confidence: two curves will be drawn next to the regression line. These curves represent a 95% confidence interval for the regression line. This interval includes the true regression line with 95% probability.
- 95% Prediction: two curves will be drawn next to the regression line. These curves represent the 95% prediction interval for the regression curve. The 95% prediction interval is much wider than the 95% confidence interval. For any given value of the independent variable, this interval represents the 95% probability for the values of the dependent variable.
- Line of equality: option to draw a line of equality (y=x) line in the graph.
- Heat map: option to display a heatmap, where background color coding indicates density of points, suggesting clusters of observations.
Residuals
In regression analysis, residuals are the differences between the predicted values and the observed values for the dependent variable. The residual plot allows the visual evaluation of the goodness of fit of the selected model.
To obtain a residuals plot, select this option in the dialog box. This graph will be displayed in a second window.
Subgroups
Click the Subgroups button if you want to identify subgroups in the scatter diagram. A new dialog box is displayed in which you can select a categorical variable. The graph will use different markers for the different categories in this variable, and optionally will show regression lines for all cases and for each subgroup.
Examples
Scatter diagram with regression line
Regression line and 95% confidence interval
Regression line and 95% prediction interval
Regression line, 95% confidence interval and 95% prediction interval
Regression line and heatmap
When you click a point on the regression line, the program will give the x-value and the f(x) value calculated using the regression equation.
You can press Ctrl P to print the scatter diagram, or function key F10 to save the picture as file on disk. To define other titles or colors in the graph, or change the axis scaling, see Format graph.
If you want to repeat the scatter diagram, possibly to select a different regression equation, then you only have to press function key F7. The dialog box will re-appear with the previous entries (see Recall dialog).
Extrapolation
MedCalc only shows the regression line in the range of observed values. As a rule, it is not recommended to extrapolate the regression line beyond the observed range. For particular applications however, such as evaluation of stability data, extrapolation may be useful, see for example the ICH guideline Evaluation of Stability Data (PDF).
To allow extrapolation, right-click in the graph and click Allow extrapolation on the context menu.
Residuals plot
When you select the option Residuals plot in the Regression line dialog box, the program will display a second window with the residuals plot. Residuals are the differences between the predicted values and the observed values for the dependent variable. The residual plot allows for the visual evaluation of the goodness of fit of the selected model or equation. Residuals may point to possible outliers (unusual values) in the data or problems with the regression model. If the residuals display a certain pattern, you should consider to select a different regression model.
Literature
- Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall.
- Eisenhauer JG (2003) Regression through the origin. Teaching Statistics 25:76-80.
- Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models. 4th ed. Boston: McGraw-Hill.