Inter-rater agreement

Command:

Tests

Description

Use Inter-rater agreement to evaluate the agreement between two classifications (nominal or ordinal scales).

If the raw data are available in the spreadsheet, use Inter-rater agreement in the Statistics menu to create the classification table and calculate Kappa (Cohen 1960; Cohen 1968; Fleiss et al., 2003).

Agreement is quantified by the Kappa (K) statistic:

K is 1 when there is perfect agreement between the classification systems
K is 0 when there is no agreement better than chance
K is negative when agreement is worse than chance.

Required input

In the dialog form you can enter the two classification systems in a 6x6 frequency table.

Select Weighted Kappa (Cohen 1968) if the data come from an ordered scale. If the data come from a nominal scale, do not select Weighted Kappa.

Use linear weights when the difference between the first and second category has the same importance as a difference between the second and third category, etc. If the difference between the first and second category is less important than a difference between the second and third category, etc., use quadratic weights.

Inter-rater agreement (Kappa) test - dialog box

In this example, from the 6 cases that observer B has placed in class 1, observer A has placed 5 in class 1 and 1 in class 2; from the 19 cases that observer B has placed in class 2, observer A has placed 3 in class 1, 12 in class 2 and 4 in class 3; and from the 12 cases that observer B has placed in class 3, observer A has placed 2 in class 1, 2 in class 2 and 8 in class 3.

After you have entered the data, click Test.

Results

MedCalc calculates the value for Kappa with its standard Error and 95% confidence interval (CI).

MedCalc calculates the inter-rater agreement statistic Kappa according to Cohen, 1960; and weighted Kappa according to Cohen, 1968. Computational details are also given in Altman, 1991 (p. 406-407). The standard error and 95% confidence interval are calculated according to Fleiss et al., 2003.

The Standard errors reported by MedCalc are the appropriate standard errors for testing the hypothesis that the underlying value of weighted kappa is equal to a prespecified value other than zero (Fleiss et al., 2003).

The K value can be interpreted as follows (Altman, 1991):

Value of K	Strength of agreement
< 0.20	Poor
0.21 - 0.40	Fair
0.41 - 0.60	Moderate
0.61 - 0.80	Good
0.81 - 1.00	Very good

In an optional Comment input field you can enter a comment or conclusion that will be included on the printed report.

Literature

Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall.
Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37-46.
Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70:213-220.
Fleiss JL, Levin B, Paik MC (2003) Statistical methods for rates and proportions, 3^rd ed. Hoboken: John Wiley & Sons.