The usual way in which the validity of epidemiological instruments is evaluated is by comparing the diagnoses it produces in a sample of respondents with the diagnoses produced by clinicians evaluating the same respondents later. The problem with this method is that it does not identify the source of disagreements, nor whether the fault lies with the epidemiological instrument or with the clinician's reinterview.
Note that the validity of the diagnoses produced by an epidemiological instrument is the product of the validity of decisions made at a series of steps. A satisfactory measure of validity provides evaluation at each of these steps so that authors are guided to make necessary improvements.
Steps at Which to Assess the Validity of a Diagnostic Interview
1. Do the interview's questions accurately and exhaustively map onto the symptoms and other criteria in the Diagnostic Manual?
2. Are the skip rules correct, so that respondents are not asked questions to which only one answer is logically possible?
3. Do respondents understand the questions as intended?
4. Do the respondents have the information necessary to answer the questions?
5. Are the questions acceptable to respondents?
6. Was the interviewing situation conducive to frank and complete answers?
7. Have the respondent's responses been correctly entered into the data base?
8. Does the computer diagnostic program accurately map onto the diagnostic algorithms in the Manual?
9. Does the computer program distinguish cases where a disorder's absence is equivocal from cases known to be negative?
Validity should be assessed at each of these steps through a method appropriate to that particular step. Such methods have not yet been fully specified, but some useful suggestions can be made.
For Step 1, a panel of experts can evaluate interview questions against the Manual's description of each symptom and criterion to see whether the questions are precisely on target and the target is fully covered. When the experts are not satisfied, the questions need to be rewritten and reevaluated.
Errors at Step 2 can be detected by considering whether an instruction to skip a subsequent question has been inserted wherever a past response signifies that the respondent cannot meaningfully answer that question or that the answer would only repeat information already obtained. Obversely, there must be no skip instruction before a question to which an answer can be informative. Careful editing can discover skip errors.
Step 3. Judging whether questions are understood as the authors intended can be determined by asking a small group of members of the population to be sampled to rephrase the questions in their own words. Experts must judge whether the rephrasing means the same thing as the original question. If it does not, they can pursue the source of the misinterpretation, rewrite the question, and retest it in the same way. Lack of shared meanings is particularly likely when an interview is given to respondents who have different educational levels or come from regions with their own idioms. To overcome these differences in understanding, interviews should be written in simple language and should not contain idiomatic expressions.
Evaluating the validity of questions written in a different language is particularly challenging. Currently, the most popular method of testing correctness of translation is back-translation, that is, having a bilingual speaker who is unfamiliar with the interview in its original language translate the translation into the original language. Discrepancies are taken to mean that the translation is incorrect. That conclusion is probably justified. What is not justified is assuming that agreements with the original interview show that the translation is adequate. This is because it may be easier for a bilingual back-translator to guess the form of the question in the original language when the translation is particularly poor, that is, when it uses the grammatical structure and idioms of the original language and uses cognates that have a different meaning in the two languages. Several solutions have been suggested, for example, translating both the original and the translation into the same third language, or having the translation done independently by several bilingual persons who then meet to reach consensus for each question on which translation best matches the meaning of the original.
Step 4. An abundance of "I don't know''responses to any question indicates that respondents do not have the information necessary to answer the question. Examples of questions likely to have this problem concern the etiology of symptoms and their dating and frequency. A solution is to ask the question in a form that minimizes demands for precision. Instead of "What caused (SYMPTOM)?'' ask "Did any of your lab tests have positive results?'' Instead of "How often did that happen?'' ask "Did it happen more than 10 times?''Instead of "How old were you the first time it happened?'' ask "Did it happen for the first time before you were 30?''
Step 5. Unacceptability of a question can be judged by frequent refusals of respondents to answer or breaking off the interview after it was asked, or not answering it honestly. Honesty can be judged by comparing answers with data from other sources such as vital statistics and criminal justice records, and by inconsistent responses.
Step 6. Invalidity due to a poor interviewing situation can be reduced by requiring that the interviewer find a time and place to be alone with the respondent. Privacy has been found in a number of studies to be a major factor in honesty.
Step 7. Being sure that the data entered into a data set are complete and accurate requires an editor who reviews each interview for omissions and legibility. It also requires a data entry and cleaning program that prevents violation of the skip rules and prevents illogical entries, such as the end of a symptom's predating its onset. Typing errors can be minimized by double entry of interview responses into the computer.
Step 8. The validity of diagnostic programs is enhanced by their transparency. Transparent programs allow persons other than experienced computer programmers to compare the program to the Manual's algorithms and check to be sure that the correct variables are used to evaluate each criterion. Techniques that improve transparency include using question numbers as names of variables and the Manual's names for the criteria as the computer's names for these criteria. Transparency is also enhanced by avoiding difficult-to-understand computer languages, even when they may be more efficient.
There has been little development of formal methods for assessing the accuracy of scoring programs, a serious lack (Marcus and Robins, 1998). One solution suggested requires two independently constructed scoring programs which are then applied to the same data set. Discrepancies in their results require making corrections in one program or the other.
Step 9. Scoring programs must give a distinctive code to cases with too much missing information to justify a negative diagnosis. This code will make it possible to drop these cases from the denominator when calculating prevalence rates.
If no problem is found with any of these nine steps, a study using this interview will produce valid results; that is, it will accomplish what the interview's authors intended. However, there is no guarantee that the diagnoses it produces will be correct. That will depend on whether the Manual's criteria are correct and correspond to real disorders. This distinction between an interview's validity, that is, its ability to assess the Manual's criteria correctly, and the Manual's validity is seldom made.
The usual test for validity of a standardized interview, rather than validating each step of its application as recommended above, is to compare its results with those obtained by an interview given by a mental health clinician. Disagreement of the epidemiological instrument's results with a clinician's diagnosis is not necessarily evidence for invalidity because clinicians do not always apply criteria as specified in the manuals. They often ignore some of the disorders covered by the interview, and may make diagnoses based on presenting complaints, ignoring disorders now in remission. However, there are now structured clinical interviews —the WHO SCAN for ICD010 and DSM-IV and the SCID for DSM-IV (Spitzer et al., 1988)—that guarantee that the clinicians administering them consider all the relevant diagnoses. They have been shown to have reasonably good agreement between clinicians trained to use them (Williams, 1992). These clinical instruments are certainly better "gold standards'' than ordinary clinical interviews for evaluating the validity of lay-interviewer or self-administered interviews. But they are not immune to problems.
Clinicians using these interviews are free to ignore any positive response they feel is irrelevant to the "real" diagnosis. They may also press the respondent further if they doubt that the denial of a symptom was correct. These, of course, are the strengths of a clinical interview, the very reason that they may serve as a test of a standardized interview. But it is impossible to tell whether a clinician's disagreement with the lay interview is due to the invalidity of the lay interview or to the clinician's idiosyncratic interpretation of the Manual. Fully standardized interviews like the CIDI and DIS are committed to following the Manual precisely as written, with no room for interpretation. Clinicians' inclination to vary in their interpretations of the Manual, even when using a semistructured interview, was shown in validity studies of DIS-III as used in the Epidemiological Catchment Area project (Robins, 1985). In St. Louis and Baltimore, the DIS produced almost identical prevalences of specific disorders. Yet psychiatrists in the two cities, who reinterviewed subjects using different semistructured interviews both intended to make DSM-III diagnoses, produced very different patterns and rates of disorder, depending on where they had been trained. It was obvious that they applied the DSM-III rules very differently. Thus, disagreements between clinicians' results and results from the DIS and CIDI might reflect psychiatrists' dissatisfaction with some of the criteria in the Manual, rather than the invalidity of the epidemiological interview.
The validity criteria offered by Robins and Guze (1970); elevated rates of the disorder in family members, stability of the diagnosis over time, forecasting known outcomes, consistent laboratory findings; are frequently cited as excellent ways of assessing validity. Unfortunately for our purposes, these criteria were designed to validate the descriptions of disorders that one might find in a Diagnostic Manual, not the validity of interviews that try to represent these descriptions. They are not helpful in assessing how well a diagnostic interview achieves the goal of validly operationalizing the Manual.
Was this article helpful?