Home Selection of a Rating Scale in Receiver Operating Characteristic Studies Some Remaining Issues
Post
Cancel

Selection of a Rating Scale in Receiver Operating Characteristic Studies Some Remaining Issues

Rationale and Objectives

The aim of this study is to compare the ratings of a group of readers that used two different rating scales in a receiver operating characteristic (ROC) study and to clarify some remaining issues when selecting a rating scale for such studies.

Materials and Methods

We reanalyzed a previously conducted ROC study in which readers used both a 5-point and a 101-point scale to identify abdominal masses in 95 cases. Summary statistics include the distribution of scores by reader for each of the rating scales, the proportion of tied scores when using the 5-point scale that correctly resolved when using the 101-point scale and the proportion of paired normal-abnormal cases where the two rating scales resulted in a different selection of an abnormal case.

Results

As a group, the readers used 84 of the rating categories when using the 101-point scale but the categories used differed for individual readers. All readers tended to resolve the majority of ties on the 5-point scale in favor of correct decisions and to maintain correct decisions when a more refined scale was used.

Conclusions

The reanalysis presented here provides additional evidence that readers in a ROC study can adjust to a 101-point scale and the use of such a refined scale can increase discriminative ability. However, the decision of selecting an appropriate scale should also consider the underlying abnormality in question and relevant clinical considerations.

In some fields, including but not limited to radiology, the application of receiver operating characteristic (ROC) type rating systems often assume an underlying continuous scale that is approximated by a discrete categorization. Historically a 5- (or a 6-) point rating scale had been used for this purpose and this method may have advantages when it is closely related to a set of commonly used diagnostic decisions/recommendations ( ). More recently, a 101-point scale has been suggested for this purpose ( ). Because of the large number of categories, a 101-point scale can be treated as a continuous scale and therefore avoid some of the analytic complexities associated with a discrete ordinal scale. Although several authors have discussed the limitations associated with either of these two approaches ( ), some general issues remain. Furthermore, the possibilities that some decisions in radiology should be viewed more appropriately as an inherently binary decision ( ) potentially increases the magnitude of the differences that can occur between discrete and continuous scales because a binary decision may be viewed as using a 2-point (discrete) scale. The purpose of this article is to clarify several issues in regard to the selection of a rating scale in an ROC study by comparing the actual ratings used by a group of readers in a study that employed both a 5-point and a 101-point scale to identify abdominal masses ( ). We also present some summary statistics useful in describing the effect of refining a given ordinal scale.

First, it should be recognized that the statistical aspects of contrasting different rating systems depends on the true underlying categorization. If the true underlying scale is continuous, then the use of a discrete scale has by definition less information and will ultimately be inferior when compared using standard statistical measures. Wagner et al ( ) demonstrated this in a comparison between a 5-point and a continuous scale when data are generated from an underlying continuous scale. Conversely, if the true underlying scale is discrete, then using a larger number of possible ratings may increase the variance. Gur et al ( ) demonstrated this in a simulation study comparing a dichotomous rating with a continuous rating when the true underlying scale is dichotomous. Thus the scale with the most desirable statistical properties often depends on the scale that is conceptually considered as “correct” (or perhaps clinically relevant).

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Methods

Get Radiology Tree app to read full this article<

Results

Get Radiology Tree app to read full this article<

Table 1

Range of Continuous Scores Corresponding to Discrete Scores for Each of the Readers Evaluated in this Study

Discrete Rating Reader 1 2 3 4 5 1 0–5 0–10 0–12 0–8 0–5 2 5–21 10–20 12–50 8–35 5–24 3 21–80 20–60 50–71 35–62 24–63 4 80–95 60–90 71–89 62–84 63–94 5 95–100 90–100 89–100 84–100 94–100

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Table 2

Cross-Classification of all Possible Pairings of Negative and Positive Cases Compared with the Verified Clinical Truth

101 Point 5 Point Correct (ND<D) Tied Incorrect (D<ND) Total Correct 8,339 122 341 8,802 (ND<D) Tied 790 120 312 1,222 Incorrect 327 33 446 806 (D<ND) Total 9,456 275 1,099 10,830

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Discussion

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

Conclusion

Get Radiology Tree app to read full this article<

Get Radiology Tree app to read full this article<

References

  • 1. Zhou X.H., Obuchowski N.A., McClish D.K.: Statistical methods in diagnostic medicine.2002.John Wiley & SonsNew York

  • 2. Rockette H.E., Gur D., Metz C.E.: The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques. Invest Radiol 1992; 27: pp. 169-172.

  • 3. Metz C.E., Herman B.A., Shen J.H.: Maximum likelihood estimation of ROC curves from continuously distributed data. Stat Med 1998; 17: pp. 1033-1053.

  • 4. Houn F.H., Bright R.A., Busher H.F., et. al.: Study design in the evaluation of breast cancer imaging technologies. Acad Radiol 2000; 7: pp. 684-692.

  • 5. Berbaum K.S., Dorfman D.D., Franken E.A., et. al.: An empirical comparison of discrete ratings and subjective probability ratings. Acad Radiol 2002; 9: pp. 756-763.

  • 6. Wagner R.F., Beiden S.V., Metz C.E.: Continuous vs. categorical data for ROC analysis. Acad Radiol 2001; 8: pp. 328-334.

  • 7. Gur D., Rockette H.E., Bandos A.I.: “Binary” and “non-binary” detection tasks: are current performance measures optimal?. Acad Radiol 2007; 14: pp. 871-876.

  • 8. Walsh S.J.: Limitations to the robustness of binormal ROC curves: effects of model misspecification and location of decision thresholds on bias, precision, size and power. Stat Med 1997; 16: pp. 669-679.

  • 9. Hadjiiski L., Chan H.P., Sahiner B., et. al.: Quasi-continuous and discrete confidence rating scales for observer performance studies: Effects on ROC analysis. Acad Radiol 2007; 14: pp. 38-48.

  • 10. Metz C.E., Pan X.: “Proper” binormal ROC curves: theory and maximum likelihood estimation. J Math Psychol 1999; 43: pp. 1-33.

This post is licensed under CC BY 4.0 by the author.