Examples of Criterion Validity

For examples of applications of criterion validity, we turn to two recent studies in the psychiatric literature. The first study (Addington et al., 1993) provides a fairly typical example of the use of correlational techniques. At issue was whether a self-report instrument can be used in populations of patients with schizophrenia to obtain valid ratings of depression. To examine this question, the authors compared self-report ratings obtained using the Beck Depression Inventory (BDI) with ratings of the Calgary Depression Scale (CDS), a semistructured interview designed to assess depression in schizophrenics. In this study, the CDS is the criterion because it makes use of informed judgements by trained clinicians, which form the current ''gold standard'' for identifying depression in clinical populations. BDI and CDS scores were compared by calculating the Pearson product-moment correlation coefficient (e.g., see Woolson, 1987), after creating scatterplots to examine the joint distribution of BDI and CDS scores as well as identifying any outliers. The latter step was essential because the presence of even a single outlier (i.e., an extreme and atypical value) could easily distort the product-moment correlation (e.g., see Simpson 1982).

Another important methodologic step employed by Addington et al., (1993) was to compare correlations between the BDI and CDS in clinically distinct subgroups of schizophrenic patients: inpatients versus outpatients, and (within these subgroups) patients who either did or did not require assistance in completing the self-report instrument. In this particular study, the correlation between the BDI and CDS was stronger among inpatients than outpatients, regardless of whether the patients required assistance (r = 0.84 vs. r = 0.96). However, the substantially greater percentage of inpatients requiring assistance (34% of inpatients vs. 12% of the outpatients) led the authors to conclude that ''depressed affect can be assessed in patients with schizophrenia by both self-report and structured interview, but the Beck Depression Inventory poses difficulties in use with inpatients'' (Addington et al., 1993, p. 561).

For our purposes, however, the substantive findings of this study were less important than the fact that this study admirably illustrated the critical importance of selecting and describing validation samples that are clinically meaningful in the context of the measurement instrument of interest (Streiner, 1993). In particular, users of such instruments need to be aware that published validation studies might have used ''samples of convenience'' (e.g., university students) that do not approximate the clinical population the user has in mind and that the results of such studies do not necessarily generalize to other samples.

Our second example of criterion validity in psychiatric research (Somervell et al., 1993) also illustrates the critical importance of the validation sample. In this study, the validity of using a questionnaire (the Center for Epidemiologic Studies Depression Scale, or CES-D) (Radloff, 1977) as a case identification tool in studies of mood disorders among Native Americans was investigated. CES-D scores were compared with DSM-III-R diagnoses (American Psychiatric Association, 1987) based on a structured psychiatric interview (the Lifetime Version of the Schedule for Affective Disorders and Schizophrenia, Endicott and Spitzer, 1978). The authors had concerns about the cross-cultural applicability not only of the screening instrument but also of the criterion itself (e.g., DSM-III-R diagnoses of affective disorders). For purposes of the study, however, it was assumed that DSM-III-R diagnoses would be relevant among Native Americans.

Although the CES-D, like the BDI in the above example, yields a numerical score, its proposed use as a screening instrument for depression was for the purpose of identifying not the degree of depression, but the presence of a particular clinical syndrome, namely, DSM-III-R major depression. The criterion was therefore a categorical (i.e., qualitative) rating rather than a numerical (i.e., quantitative) rating, making it inappropriate to use correlational procedures. Instead, to evaluate the validity of the instrument for case identification, the authors employed statistical methods that have been expressly developed for qualitative data, including sensitivity, specificity, and receiver operating characteristic (ROC) analysis.

Sensitivity and specificity are both calculated using data that have been summarized in a 2 x 2 table of frequencies (see Table 1 for definitions and computational formulas). In the example at hand, a 2 x 2 table was used to cross-classify the numbers of screened persons with and without the criterion (e.g., a DSM-III-R diagnosis of major depression) who either did or did not score above the cutoff for depression in the screening instrument, the CES-D. (ROC analysis was used to determine the optimal cutoff value for the CES-D.) As an illustrative finding, the sensitivity for DSM-III-R major depression was 100% (i.e., all three persons in the sample with a diagnosis of major depression scored above the cutoff on the CES-D). The corresponding value of specificity was 82% (i.e., 82% of those persons in the sample who did not have diagnoses of major depression scored below the CES-D cutoff for depression). It follows directly from the reported specificity value of 82% that 18% (100% - 82%) of the persons in the sample with validity of construct 155 TABLE 1. 1. Computation of Indices of Criterion Validity and Predictive Valuea a








b + d aa, b, c, d, and N are frequencies (e.g., numbers of persons rated). Sensitivity = a/(a + c); the probability of a positive rating among those possessing the criterion. Specificity = d/(b + d); the probability of a negative rating among those lacking the criterion. Positive predictive value = a/(a + b); the probability of having the criterion among those with positive ratings. Negative predictive value = d/(c + d); the probability that those with negative ratings will not have the criterion. Prevalence = (a + c)/N; the base rate of the criterion in the validation sample.

no psychiatric diagnoses or with DSM-III-R diagnoses other than major depression scored above the CES-D cutoff and would have been classified as depressed by that screening instrument.

Whether or not this degree of misclassification error (or invalidity) is considered to be an unacceptably high "false-positive" rate depends on the proposed use of the instrument and on the comparable "operating characteristics'' of alternative instruments. For example, a higher CES-D cutoff value could be expected to decrease the false-positive rate (via increased specificity), but at the expense of sensitivity. In this particular study, a higher CES-D cutoff actually increased specificity without decreasing sensitivity, but this was probably attributable to the small number of cases with DSM-III-R diagnoses of major depression. In most studies there is a systematic trade-off between sensitivity and specificity, and for that reason both of these indices of criterion validity must be considered together in determining whether a particular instrument is more valid than the available alternatives. ROC analysis provides a useful framework for making such comparisons (e.g., see Murphy et al., 1987). In the present example, the non-negligible false-positive rate was consistent with the investigators concerns (based on previous research by a number of researchers using other samples) that the CES-D might be reflecting symptoms of not only major depression but also increased levels of anxiety, demoralization, or even physical ill health (Somervell et al., 1993).

The study by Somervell et al. (1993) also illustrates the difference between criterion validity and the related, but nevertheless distinct, concept of predictive value. Positive predictive value is literally the predictive value of a positive rating, that is, the probability of having the criterion of interest given a positive rating on the instrument under investigation. (Formulas for calculating positive predictive value, and the related index, negative predictive value, are given in Table 1.) Since the criterion (e.g., DSM-III-R major depression) is frequently of more direct clinical importance than the rating (e.g., a particular CES-D score), positive and negative predictive values are often more clinically meaningful than sensitivity and specificity. For example, most clinicians would probably be more interested in the usefulness of the CES-D for predicting major depression than the other way around. However, positive predictive value is a joint function of sensitivity, speci ficity, and prevalence, such that low prevalence values can severely constrain the values of positive predictive value that can be realistically attained, even with very high sensitivity and specificity values (Baldessarini et al., 1983; Glaros and Kline, 1988). (Negative predictive value is similarly constrained by high prevalence values.)

In the study by Somervell et al. (1993), the prevalence of major depression can be estimated from the rate of major depression in the sample as 3/120 = 0.025. Using a cutoff value of 16 on the CES-D, the reported specificity value of 82.1% therefore corresponds to a positive predictive value of 0.125. In other words, even though sensitivity was perfect (100%) and specificity was very high, only one of every eight persons who scored above the CES-D cutoff of 16 would be expected actually to have major depression. Even increasing the CES-D cutoff to improve specificity would not dramatically change this result. Again, this is due to the constraint imposed by the low estimated prevalence of major depression in the study population. (With the CES-D cut-off set at 28, the reported specificity value of 96.6% corresponds to a positive predictive value of 0.429.) In conclusion, this example shows that even though an instrument may have excellent criterion validity as assessed using standard indices (namely, sensitivity and specificity), the actual predictive value of the instrument could be much more limited, depending on the prevalence of the disorder of interest, which in turn may vary with the composition of the validation sample.

Beat Depression Now

Beat Depression Now

Let me be up front. My intention is to sell you something. Normally, it's not wise to come out and say that. However, I can do so because I have such an incredible deal for you that you'd be crazy to pass on it.

Get My Free Ebook

Post a comment