Reliability & Validity

Test-Retest Reliability – Methods, Formula and Examples

Test-Retest Reliability

Test-Retest Reliability

Definition:

Test-retest reliability is a measure used in research and psychometrics to assess the consistency or stability of a measurement instrument over time. It specifically examines whether the same results are obtained when the same individuals or objects are measured on two separate occasions.

In test-retest reliability, the same test or measure is administered to a group of participants on two different occasions, with a certain time interval between the administrations. The scores or measurements obtained from the two administrations are then correlated to determine the extent of agreement or consistency between them.

Also see Reliability

Test-Retest Reliability Methods

There are several methods to assess test-retest reliability, and the choice of method depends on the nature of the measurement instrument and the research context. Here are some common methods for assessing test-retest reliability:

Pearson correlation coefficient:

This method measures the linear relationship between two sets of scores obtained from two administrations of the same measurement instrument. Pearson correlation coefficient ranges from -1 to 1, with 0 indicating no correlation, and a perfect correlation is indicated by 1.

Intraclass correlation coefficient (ICC):

This method is commonly used for measures with continuous scores, such as scales or questionnaires. It estimates the degree of agreement between two sets of scores obtained from the same participants at different time points. The ICC ranges from 0 to 1, where higher values indicate better test-retest reliability.

Cohen’s kappa:

This method is commonly used for categorical measures, such as ratings or classifications. It measures the degree of agreement between two sets of ratings obtained from the same participants at different time points. Cohen’s kappa ranges from -1 to 1, where values close to 1 indicate good test-retest reliability.

Bland-Altman plot:

This graphical method allows researchers to visually inspect the agreement between two sets of measurements. It plots the differences between the two sets of scores against the average of the scores, and the limits of agreement are drawn to show the range of differences that are likely to be due to chance.

Standard error of measurement (SEM):

This method estimates the degree of error associated with the measurement instrument. It is calculated as the standard deviation of the differences between two sets of scores, and it reflects the amount of measurement error that is likely to occur due to chance.

Test-Retest Reliability Formula

The formula for calculating test-retest reliability depends on the specific statistical method used. Test-Retest Reliability Formula are as follows:

Pearson correlation coefficient (r)

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two sets of scores obtained from the same participants at different time points.

The formula for calculating the Pearson correlation coefficient is:

r = (Σ(Xi – X̄)(Yi – Ȳ)) / (√(Σ(Xi – X̄)²) * √(Σ(Yi – Ȳ)²))

Where:

  • Xi and Yi are the scores of participant i at Time 1 and Time 2, respectively.
  • X̄ and Ȳ are the means of the scores at Time 1 and Time 2, respectively.
  • Σ denotes summation, summing over all participants.

The resulting value of r ranges from -1 to 1, where 0 indicates no correlation, 1 indicates a perfect positive correlation, and -1 indicates a perfect negative correlation.

Intraclass correlation coefficient (ICC)

The intraclass correlation coefficient is commonly used for assessing the agreement or consistency between two sets of scores obtained from the same participants at different time points. The ICC can be calculated using various formulas, depending on the specific model chosen (e.g., one-way random effects, two-way random effects, two-way mixed effects).

One common formula for calculating ICC based on a two-way random effects model is:

ICC = (MSb – MSw) / (MSb + (k – 1) * MSw)

Where:

  • MSb is the mean square between subjects (variance due to differences between participants).
  • MSw is the mean square within subjects (variance due to differences within participants).
  • k is the number of measurements or time points.

The resulting value of ICC ranges from 0 to 1, where higher values indicate better test-retest reliability.

Standard Error of Measurement (SEM)

The Standard Error of Measurement estimates the amount of error associated with individual scores on a measurement instrument. It provides a measure of the precision or reliability of individual scores. SEM is calculated using the standard deviation of the score differences between the two test administrations (SDdiff) and the square root of the reliability coefficient (rxx) of the measurement instrument:

SEM = SDdiff * √(1 – rxx)

The SEM provides an estimate of the range within which an individual’s true score is likely to fall.

Proportional Reduction in Error (PRE)

The Proportional Reduction in Error is a statistic that compares the amount of error in scores before and after the test-retest measurement. It quantifies the reduction in error due to the stability of the measurement instrument.

The formula for PRE is:

PRE = (SDpre – SDpost) / SDpre

Where:

  • SDpre is the standard deviation of scores at Time 1 (pretest).
  • SDpost is the standard deviation of scores at Time 2 (posttest).

PRE ranges from 0 to 1, with higher values indicating a greater reduction in error and better test-retest reliability.

Confidence Interval for Reliability

It is common to estimate the confidence interval around the obtained reliability coefficient to provide a range of plausible values. The confidence interval accounts for sampling variability and provides a measure of uncertainty around the estimated reliability coefficient.

The formula for calculating the confidence interval depends on the specific method used to estimate reliability. For example, if the intraclass correlation coefficient (ICC) was calculated using a two-way random effects model, the confidence interval can be estimated using statistical software or formulas specific to the ICC.

Test-Retest Reliability Examples

Here are a few examples of test-retest reliability in different contexts:

  • Psychological Assessment: Suppose a researcher wants to assess the test-retest reliability of a depression questionnaire. They administer the questionnaire to a group of participants and then readminister it to the same participants after a two-week interval. The researcher calculates the Pearson correlation coefficient between the scores obtained at the two time points. A high correlation coefficient (e.g., r = 0.85) indicates good test-retest reliability, suggesting that the questionnaire consistently measures depression symptoms over time.
  • Physical Measurements: In a study examining the test-retest reliability of a hand grip strength measurement, participants are assessed on two occasions, separated by one week. The researcher records the maximum grip strength in kilograms during each measurement session. The intraclass correlation coefficient (ICC) is then computed to determine the agreement between the two measurements. A high ICC value (e.g., ICC = 0.90) suggests strong test-retest reliability, indicating that the hand grip strength measurement is consistent and stable over time.
  • Educational Assessment: A researcher aims to evaluate the test-retest reliability of a mathematics proficiency test. The test is administered to a group of students, and the same test is readministered to the same group of students two months later. The researcher calculates the Cohen’s kappa coefficient to measure the agreement between the students’ classifications (e.g., proficient, not proficient) at the two time points. A high kappa value (e.g., kappa = 0.80) indicates good test-retest reliability, suggesting consistent categorization of students’ mathematics proficiency.
  • Neuroimaging Measures: In a study investigating the test-retest reliability of a functional magnetic resonance imaging (fMRI) task, participants perform a cognitive task during two separate scanning sessions, with a one-week interval. The researcher analyzes the correlation between brain activation patterns elicited by the task across the two sessions. A high correlation coefficient, such as in the spatial patterns of brain activity (e.g., r = 0.75), suggests good test-retest reliability, indicating that the fMRI task consistently captures the neural responses it intends to measure.

When to use Test-Retest Reliability

Test-retest reliability is useful in several situations where researchers and practitioners want to assess the stability and consistency of a measurement instrument over time. Here are some scenarios where test-retest reliability is commonly employed:

  • Psychometric Evaluation: Test-retest reliability is often used in the development and validation of psychological and educational assessment instruments, such as questionnaires, scales, or tests. It helps determine whether the instrument consistently measures the intended construct over time. By administering the instrument to the same group of participants on two separate occasions, researchers can assess the degree of agreement between the measurements and establish the instrument’s stability.
  • Longitudinal Studies: In longitudinal research designs, where data is collected from the same participants at multiple time points, test-retest reliability can be valuable. By assessing the reliability of the measurement instrument across different time intervals, researchers can ensure that the observed changes or differences in scores are not solely due to measurement error. It helps determine whether the observed changes in scores reflect true changes in the construct being measured.
  • Clinical and Diagnostic Assessments: Test-retest reliability is crucial in clinical settings when evaluating the stability of diagnostic assessments or measures used for monitoring treatment progress. By assessing the reliability of measures over time, clinicians can track changes in symptoms or functioning and determine the effectiveness of interventions. Test-retest reliability provides an indication of the consistency of the assessment tool and helps clinicians make reliable and valid interpretations of the results.
  • Program Evaluation: When assessing the effectiveness of an intervention program, researchers often collect data before and after the program implementation. Test-retest reliability can be used to examine the stability of outcome measures, such as self-report surveys or performance assessments, to ensure that observed changes are not solely due to measurement error. It helps determine whether the changes observed in the outcome measures can be attributed to the program’s impact.
  • Quality Control: In industries or settings where measurement instruments are used for quality control purposes, test-retest reliability ensures consistency in the measurement process. By periodically assessing the stability of the measurement instrument, organizations can monitor any potential drift or changes in the measurement system and take corrective actions if necessary.

Importance of Test-Retest Reliability

The importance of test-retest reliability can be understood in the following ways:

  • Consistency and Stability: Test-retest reliability provides an indication of the consistency and stability of the measurement over time. If a test or assessment is reliable, it should produce similar results when administered to the same individuals at different points in time, assuming that the underlying construct being measured has not changed. This reliability ensures that the measurement is not affected by random or transient factors and provides a more accurate representation of the true score.
  • Validity: Test-retest reliability is closely linked to the concept of validity, which refers to the extent to which a test measures what it is intended to measure. If a test is not reliable, it is unlikely to be valid because a measurement cannot be valid if it is not consistent. By establishing test-retest reliability, researchers can gather evidence to support the validity of the measurement tool or assessment.
  • Error Detection: Test-retest reliability helps identify sources of error in measurements. Any factors that introduce inconsistency or variability in test scores over time can be detected through this reliability analysis. It allows researchers to determine whether variations in the scores are due to true changes in the construct being measured or if they are caused by measurement error, such as test administration differences, participant mood, or environmental factors.
  • Monitoring Change: Test-retest reliability is essential when researchers or practitioners aim to assess changes in individuals or groups over time. By using a reliable measurement instrument, they can determine whether observed changes are genuine or if they are influenced by measurement error. This reliability enables the detection of true changes in variables, such as changes in psychological states, academic performance, or treatment effects.
  • Comparative Research: Test-retest reliability is valuable when comparing groups or conditions. If the measurement instrument is not reliable, observed differences between groups may be due to measurement inconsistencies rather than genuine differences. Reliable measurements provide a solid foundation for comparing different populations, interventions, or experimental conditions, allowing researchers to draw more accurate conclusions from their data.

Limitations of Test-Retest Reliability

Limitations of Test-Retest Reliability are as follows:

  • Time Interval: The time interval between the two test administrations is critical in test-retest reliability. If the time interval is too short, individuals may remember their previous responses, leading to artificially inflated correlations. On the other hand, if the time interval is too long, the stability of the construct being measured may change, making the test-retest reliability less meaningful. Determining an appropriate time interval that balances these considerations can be challenging.
  • Practice Effects: Test-retest reliability assumes that there are no practice effects, meaning that individuals’ performance or responses do not improve simply because they have taken the test before. However, in some cases, individuals may become more familiar with the test format or content, leading to improved performance during the retest. This can artificially inflate the reliability estimate and may not accurately reflect the stability of the construct being measured.
  • Carryover Effects: Conversely, test-retest reliability may be affected by carryover effects, where the experience of taking the initial test influences individuals’ responses during the retest. For example, individuals may become fatigued, bored, or unmotivated during the retest, leading to decreased performance or different responses. These carryover effects can undermine the reliability of the measurement.
  • Contextual Factors: Test-retest reliability assumes that the testing conditions and context remain constant across the two administrations. However, contextual factors, such as changes in the environment, instructions, or test administrators, can introduce variability that affects individuals’ responses. Any changes in these factors between test administrations can compromise the reliability estimate.
  • Individual Differences: Test-retest reliability assumes that individuals’ characteristics, traits, or behaviors remain relatively stable over time. However, some constructs may inherently exhibit variability or change due to developmental processes, learning, or life events. For example, attitudes, preferences, or beliefs may evolve over time, affecting individuals’ responses during the retest. The stability of the construct being measured should be considered when interpreting test-retest reliability estimates.
  • Regression to the Mean: Test-retest reliability can be influenced by regression to the mean. This phenomenon suggests that individuals with extreme scores on the first test administration are likely to score closer to the average on the second administration due to random fluctuations. As a result, the correlation between the two administrations may be artificially inflated or deflated.
  • Sample Characteristics: Test-retest reliability estimates can be influenced by the characteristics of the sample being tested. For example, if the sample consists of individuals with high levels of stability in the construct being measured, the reliability estimate may be higher than it would be in a more diverse sample. It is essential to consider the generalizability of the reliability estimate to the intended population or target group.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer