Yale Center for Teaching and Learning

Developing Reliable Student Assessments

Reliability refers to how well a score represents an individual’s true ability. Reliability is important to ensure that assessments accurately measure student knowledge. Because reliability refers to the score, a test or rubric cannot be deemed to be reliable or unreliable. Reliable scores provide more useful feedback to students’ about their performance and instructors about the effectiveness of their teaching. There are many reasons that a score may not perfectly represent an individual’s ability. For instance, test anxiety, distractions in the testing environment, or guessing could cause discrepancies between a score and an individual’s actual ability. While some of these factors cannot be completely eliminated, instructors can improve reliability when designing assessments, grading student work, and analyzing student performance on individual test items or criteria. Several methods are commonly used to estimate reliability.

Examples of Reliability Measures: 

  • Inter-rater – Two separate individuals evaluate and score a subject’s test, essay, or performance and the scores from each of the raters are correlated.  The correlation coefficient can be used as an estimate of reliability. However, several other statistics can be calculated by instructors to compare the scores from two raters.  For instance, Cohen’s kappa considers the amount of agreement that may occur between two raters as a result of chance.
  • Test-Retest – Individuals take the same test on separate occasions and the scores can be correlated by instructors. The correlation coefficient is the estimate of reliability. Because individuals learn from tests, consideration needs to be given to the amount of time between administrations.
  • Parallel Forms – Two equivalent tests (measuring the same concepts, knowledge, skills, abilities, etc.) are given to the same group of individuals and the scores can be correlated by instructors. The correlation coefficient is the estimate of reliability. Unfortunately, it can be very difficult to design two identical tests.
  • Split-Half – One test is divided into two sets of items.  An individual’s score on half the test is correlated with their score on the other half of the test. However, instructors can decide to split a test in many different ways (i.e. even versus odd, first versus last, etc.), which will influence the correlation coefficient.
  • Cronbach’s Alpha – Cronbach’s Alpha is generally interpreted as the mean of all possible split-half combinations. It is the most commonly reported measure of reliability when analyzing Likert type scales or multiple choice tests.  An alpha above .7 is typically considered acceptable. Cronbach’s Alpha can be calculated by instructors in Excel or any other statistical software package.

Recommendations

Reliability can be increased by a number of methods. If the evaluation is an essay or performance based:

  • Design a rubric – Rubrics help the evaluator(s)/grader(s) focus on the same criteria across all submissions. Click here for more information on rubric design.
  • Grade item by item – If students are given multiple essays or problem sets, evaluate/grade the first essay/problem for all students before grading the second essay/problem. This allows the evaluator/grader to only have to remember one set of criteria at a time and minimizes the effect of the impact of fatigue or mood differentially affecting any one student’s performance.
  • Grade anonymously – When possible do not look at students’ names before evaluating/grading. Every grader/evaluation possesses some biases, which can either positively or negatively affect individual students score. For instance, if a student is a hard worker in class an instructor may be more lenient when grading an essay from that student. Grading anonymously minimizes the effect that many of these biases have on the grading process.   
  • Train graders – If multiple graders are being used, it is important to provide training to the graders on how to utilize the rubric or evaluation/grading criteria.  Sample essays or performance should be provided to graders.  Additionally, for each essay or problem a subset of submissions should be independently be scored by multiple graders.  Inter-rater reliability can be calculated on the subset and the graders should discuss any discrepancies before grading the rest of the submissions.

If the evaluation consists of a multiple choice tests or Likert-type items:

  • Design the assessment using a table of specifications:
    • A table of specifications outlines the content that is covered in a test or assessment. A table of specifications typically consists of three main components. First, a list of topics that are covered on the assessment. Second, a classification or taxonomy (i.e. Bloom’s taxonomy) that describes the types of questions that are on the exam.  Third, an indicator of the number of questions to be presented that corresponds to each content area and classification.
Sample Table of Specifications: Using Components of Bloom’s Taxonomy
Topic or Content Area Multiple choice questions measuring recall Multiple choice questions measuring application Multiple choice questions measuring evaluation Total Number of Questions
Chemical Reactions Q 1, 6, 7 Q 12, 14, 17, 19 Q 21, 24, 26, 29, 30, 35, 38, 39 15
Thermodynamics Q 2, 3, 8, 9 Q 11, 15, 18 Q 22, 25, 31 10
Chemical Equilibrium Q 4, 5, 10 Q 13, 16, 20 Q 23, 27, 28, 32, 33, 34, 36, 37, 40 15
Total Number of Questions 10 10 20 40
  • The table of specifications will allow the subscales to be created if multiple concepts that are being tested. For instance, separate reliability coefficients can be calculated for items that test the first unit and items that measure the second unit.  A table of specifications will also provide detailed feedback to the students’ and instructor about the content that has been learned.

 

  • Conduct item-level diagnostics to improve the test. Please note that some testing software can provide the data described below for you in the form of a report. 
    • Cronbach’s alpha – When calculating Cronbach’s Alpha, it is possible to determine which items are negatively impacting reliability.  Those items could then be removed to increase the reliability of the score.
    • Item difficulty – The percentage of students who answered an item correctly. Items that are too difficult negatively impact reliability.  Items that are too easy do not detect differences between high and lower performing students.
    • Item discrimination – Examines how well an item is ability to discriminate between high performing and low performing students.  Items that do not perform as expected (higher performing students get the answer right more than lower performing students) negatively impact reliability
    • Distractor analysis – Determine which distractors that students (or students of different performance levels) select.  Any distractor that is not selected (or is rarely selected) should be changed.  If students are able to eliminate answer choices, they have a higher probability of guessing the correct answer without understanding the content.

References

Cronbach LJ. (1951). Coefficient alpha and the internal structure of tests. Psychometrika 16: 297-334.

Guttman L. (1945). A basis for analyzing test-retest reliability. Psychometrika 10: 255-282.

Gwet KL. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.

Malouff J. (2008). Bias in grading. College Teaching 56(3):191-192.

Murphy KR  & Davidshofer CO. (1988). Psychological testing. Principles, and Applications. Prentice Hall: Englewood Cliffs, NJ.

Osterlind, SJ. (2006). Modern measurement: Theory, principles, and applications of mental appraisal. Pearson: Upper Saddle River, NJ.