Cox proportional-hazards regression
Whereas the Kaplan-Meier method with log-rank test is useful for comparing survival curves in two or more groups, Cox regression (or proportional hazards regression) allows analyzing the effect of several risk factors on survival.
The probability of the endpoint (death, or any other event of interest, e.g. recurrence of disease) is called the hazard. The hazard is modeled as:
where X1 ... Xk are a collection of predictor variables and H0(t) is the baseline hazard at time t, representing the hazard for a person with the value 0 for all the predictor variables.
By dividing both sides of the above equation by H0(t) and taking logarithms, we obtain:
We call H(t) / H0(t) the hazard ratio. The coefficients bi...bk are estimated by Cox regression, and can be interpreted in a similar manner to that of multiple logistic regression.
Suppose the covariate (risk factor) is dichotomous and is coded 1 if present and 0 if absent. Then the quantity exp(bi) can be interpreted as the instantaneous relative risk of an event, at any time, for an individual with the risk factor present compared with an individual with the risk factor absent, given both individuals are the same on all other covariates.
Suppose the covariate is continuous, then the quantity exp(bi) is the instantaneous relative risk of an event, at any time, for an individual with an increase of 1 in the value of the covariate compared with another individual, given both individuals are the same on all other covariates.
Survival time: The name of the variable containing the time to reach the event of interest, or the time of follow-up.
Endpoint: The name of a variable containing codes 1 for the cases that reached the endpoint, or code 0 for the cases that have not reached the endpoint, either because they withdrew from the study, or the end of the study was reached.
Predictor variables: Names of variables that you expect to predict survival time.
The Cox proportional regression model assumes that the effects of the predictor variables are constant over time. Furthermore there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable var can be obtained by entering LOG(var) as predictor variable.
Select: A selection criterion to include only a selected subgroup of cases in the graph.
In the example (taken from Bland, 2000), "survival time" is the time to recurrence of gallstones following dissolution (variable Time). Recurrence is coded in the variable Recurrence (1= yes, 0 =No). Predictor variables are Dis (= number of months previous gallstones took to dissolve), Mult (1 in case of multiple previous gallstones, 0 in case of single previous gallstones), and Diam (maximum diameter of previous gallstones).
This table shows the number of cases that reached the endpoint (Number of events), the number of cases that did not reach the endpoint (Number censored), and the total number of cases.
Overall Model Fit
The Chi-squared statistic tests the relationship between time and all the covariates in the model.
Coefficients and Standard Errors
Using the Forward selection method, the two covariates Dis and Mult were entered in the model which significantly (0.0096 for Dis and 0.0063 for Mult) contribute to the prediction of time.
MedCalc lists the regression coefficient b, its standard error, Wald statistic (b/SE)2, P value, Exp(b) and the 95% confidence interval for Exp(b).
The coefficient for months for dissolution (continuous variable Dis) is 0.0429. Exp(b) = Exp(0.0429) is 1.0439 (with 95% Confidence Interval 1.0107 to 1.0781), meaning that for an increase of 1 month to dissolution of previous gallstones, the hazard ratio for recurrence increases by a factor 1.04. For 2 months the hazard ratio increases by a factor 1.042.
The coefficient for multiple gallstones (dichotomous variable Mult) is 0.9335. Exp(b) = Exp(0.9635) is 2.6208, meaning that a case with previous gallstones is 2.6208 (with 95% Confidence Interval 1.3173 to 5.2141) more likely to have a recurrence than a case with a single stone.
Variables not included in the model
The variable Diam was found not to significantly contribute to the prediction of time, and was not included in the model.
Baseline cumulative hazard function
Finally, the program lists the baseline cumulative hazard H0(t), with the cumulative hazard and survival at mean of all covariates in the model.
The baseline cumulative hazard can be used to calculate the survival probability S(t) for any case at time t:
where PI is a prognostic index:
The graph displays the survival curves for all categories of the categorical variable Mult (1 in case of multiple previous gallstones, 0 in case of single previous gallstones), and for mean values for all other covariates in the model.
If no covariate was selected for Graph - Subgroups, or if the selected variable was not included in the model, then the graph displays a single survival curve at mean of all covariates in the model.
Sample size considerations
Sample size calculation for logistic regression is a complex problem, but based on the work of Peduzzi et al. (1995) the following guideline for a minimum number of cases to include in your study can be suggested.
Let p be the smallest of the proportions of negative or positive cases in the population and k the number of predictor variables, then the minimum number of cases to include is:
N = 10 k / p
For example: you have 3 predictor variables to include in the model and the proportion of positive cases in the population is 0.20 (20%). The minimum number of cases required is
N = 10 x 3 / 0.20 = 150
If the resulting number is less than 100 you should increase it to 100 as suggested by Long (1997).