Yale Center for Teaching and Learning

Considering Validity in Assessment Design

Validity is a key concept in assessment.  Many methods are used to measure student learning and skill development. Often scores (i.e. grades) are reported back to students and function as feedback to promote future progress or as a final evaluation to demonstrate a degree of mastery. However, if the scores do not accurately represent student performance on any topic, unit, or course, then the scores become impossible to interpret. Therefore, it is critically important that scores are meaningful and accurately reflect the purpose of the assessment. Instructors can improve the validity of their classroom assessments when designing the assessment and using evidence reporting the scores back to students.

The definition and conceptualization of validity has evolved over time. Early definitions separated validity into different types including: content validity, face validity, curricular validity, construct validity, convergent validity, discriminant validity, criterion validity, predictive validity, and concurrent validity.  However, contextual factors, the populations being tested, and the purposes for which tests are used vary and change over time.  As a result, scholars argued that a test itself cannot be valid or invalid. Consequently, validity is now considered to be the “process of constructing and evaluating arguments for and against the identified interpretation of test scores and their relevance to the proposed use” (AERA, APA, NCME, 2014, p. 11). This conceptualization of validity was agreed upon by a joint committee of scholars from the American Educational Research Association, the American Psychology Association, and the National Council on Measurement in Education in 2014. The standards published by the committee represent the current professional consensus of validity.

The standards emphasize that validity is the joint responsibility of test developers as well as the individuals that administer tests. Developers suggest appropriate interpretations of the test scores for a specified population and provide initial evidence to support their process and arguments. Test users and administrators examine and gather evidence to make additional arguments that suggest that both the interpretation of the scores and the consequences of the use of the scores is appropriate given the purpose of the instrument and the population being evaluated. Validity evidence must continually be gathered by both groups as the consequences of the use of the scores become more apparent.  

Types of Validity Evidence and Recommended Strategies

The standards outline several general categories of validity evidence including: evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and evidence for validity and consequences of testing.

Evidence Based on Test Content

This form of evidence is used to demonstrate that the content of the test (e.g. items, tasks, questions, wording, etc.) is related to the construct that it was intended to measure. For example, a classroom assessment should not have items or criteria that measure topics that are not related to the objectives of the course. Instructors can design a table of specifications for tests to ensure and communicate how the content of a course or unit is being measured. For larger scale assessments, a panel of experts is usually convened to design the table of specifications and review questions to ensure that they that are representative of the field of knowledge that is being measured.

Evidence Based on Response Processes

This form of evidence is used to demonstrate that the assessment requires the participant to engage in the behavior that is supposed to be needed in order to complete the task. For instance, if an item is designed to measure reading comprehension, it is important to determine if participants are attempting to comprehend the passages or are instead relying on other test-taking strategies. Instructors can gather evidence based on response processes by analyzing qualitative responses to identify how students arrived at answers or by asking students how they approach specific questions or problems.  Larger scale testing requires a more systematic interviewing process and often relies on think-aloud protocols.

Evidence Based on Internal Structure

This form of evidence is used to demonstrate that the relationships between scores on individual test items aligns with the construct(s) that are being measured. For example, if an assessment is measuring both chemical bonding and chemical equilibrium, scores on different chemical bonding items should have a strong relationship with each other, and scores on different chemical equilibrium items should have a strong relationship with each other. Instructors can gather evidence based on internal structure by conducting item level analyses (see Reliability) or calculating an exploratory or confirmatory factor analysis to determine how well similar items relate to each other.

Evidence Based on Relation to Other Variables

This form of evidence demonstrates that a score measuring a defined construct relates to other scores measuring that same construct (convergent) and does not relate to other scores measuring different constructs (divergent).  For example, a score representing mathematical problem solving on one test should relate strongly with a score representing mathematical problem solving on another test. Similarly, mathematical problem-solving scores should not relate as strongly to scores that represent reading comprehension. Instructors should gather several different types of data about students’ ability or knowledge of a particular construct in order to generate validity evidence based on relation to other variables.  When developing a scale or test for educational research purposes, it is important to demonstrate how the scale relates to other established instruments that measure the same or similar constructs.

Evidence Based on Consequences of Testing

The extent to which consequences of the use of the score are congruent with the proposed uses of the assessment. An intended consequence of a score on a placement exam would be appropriate placement in introductory courses so that all students have the best opportunity to achieve success. Evidence would need to be gathered to determine that the scores correspond to success in the course.  Additionally, unintended consequences such as decreased student motivation or intention to persist in a major could occur for students who score poorly on the initial exam. Instructors can gather evidence based on the consequences of testing by ensuring that scores on their assessments relate to intended future outcomes.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC.

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC.

American Psychological Association, American Educational Research Association, and National Council on Measurement in Education. (1974). Standards for educational and psychological tests and manuals. Washington, DC.

American Psychological Association, American Educational Research Association and National Council on Measurement in Education. (1966). Standards for educational and psychological tests and manuals. Washington, DC.

Kane, M. (2013). The argument-based approach to validation. School Psychology Review 42(4):448-457.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 17-64). Washington, DC: National Council on Measurement in Education and the American Council on Education.

Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin 112:527-535.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–103). Washington, DC: American Council on Education and National Council on Measurement in Education.

Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist 35(11):1012-1027.