“Can a questionnaire be reliable but not valid?”
In my work as an IQA, I’ve often encountered questions like the above regarding the ‘validity’ of evidence for a learner’s portfolio, and so I felt that this is a topic worth covering in more detail, particularly for less experienced on-programme assessors.
Validity and reliability (along with fairness) are considered two of the core principles of high quality assessments. Though these two qualities are often spoken about as a pair, it is important to note that an assessment can be reliable (i.e., have replicable results) without necessarily being valid (i.e., accurately measuring the skills it is intended to measure), but an assessment cannot be valid unless it is also reliable.
Validity is arguably the most important criteria for the quality of a test. The term validity refers to how well a test measures what it is supposed to measure. Valid assessments produce data that can be used to inform educational decisions at multiple levels, from improving training provision and effectiveness to evaluating assessors impact, to individual learner gains and performance.
However, validity is not a property of the test itself; rather, it is the degree to which certain conclusions drawn from the test results can be considered “appropriate and meaningful.” The validation process includes the assembling of evidence to support the use and interpretation of test scores based on the concepts which the test is designed to measure, known as constructs.
If a test does not measure all the skills within a construct, the conclusions drawn from the test results may not reflect the learner’s knowledge accurately, and thus, threaten its overall validity.
To be considered valid, “an assessment should be a good representation of the knowledge and skills it intends to measure,” and to maintain that validity for a wide range of learners, it should also be both “accurate in evaluating students’ abilities” and reliable “across testing contexts and scorers.” (Source)
On a test with high validity the items will be closely linked to the test’s intended focus. For many certification and professional licensure tests this means that the items will be highly related to a specific job or occupation. If a test has poor validity then it does not measure the job-related content and competencies it ought to. When this is the case, there is no justification for using the test results for their intended purpose.
Factors Impacting Validity
Before defining validity, how it is measured and differentiating between the different types of validity, it is important to understand how external and internal factors can impact validity.
A learner’s literacy level can have an impact on the validity of an assessment. For example, if a learner struggles to understand what a question is asking, a test will obviously not be an accurate assessment of what the learner truly knows about a subject. Educators and assessors should, therefore, confirm that an assessment is at the correct reading level of the learner.
Learner self-efficacy can also impact validity of an assessment. If learners have low self-efficacy, or beliefs about their abilities in the particular area they are being tested in, they will typically perform lower. Their own doubts hinder their ability to accurately demonstrate knowledge and comprehension.
The anxiety levels of a learner is also a factor to be aware of. Learners with high ‘test anxiety’ will underperform due to emotional and physiological factors, which can lead to a misrepresentation of their levels of knowledge and ability.
In terms of the types of evidence that can be used for evaluating validity – including in the case of that recorded for apprenticeship portfolios – may include:
- Evidence of alignment, such as a report from a technically sound independent study documenting alignment between the assessment and its test blueprint, and between the blueprint and the government’s standards;
- Evidence of the validity of using results from the assessments for their primary purposes, such as a discussion of validity in a technical report that states the purposes of the assessments, intended interpretations, and uses of results;
- Evidence that scores are related to external variables as expected, such as reports of analyses that demonstrate positive correlations with 1) external assessments that measure similar constructs, 2) trainers’ judgments of learner readiness, or 3) academic characteristics of test takers.
Types of Validity
As well as the main types of validity which are explained below, there are also several ways to estimate the validity of a test including content validity, concurrent validity, and predictive validity. The “face validity” of a test is sometimes also mentioned.
While there are several types of validity, the most important type for most certification and licensure programmes is probably that of content validity. Content validity is a logical process where connections between the test items and the job-related tasks are established.
If a thorough test development process was followed, a job analysis was properly conducted, an appropriate set of test specifications were developed, and item writing guidelines were carefully followed, then the content validity of the test is likely to be very high.
Content validity is typically estimated by gathering a group of subject matter experts (SMEs) together to review the test items. Specifically, these SMEs are given the list of content areas specified in the test blueprint, along with the test items intended to be based on each content area. The SMEs are then asked to indicate whether or not they agree that each item is appropriately matched to the content area indicated.
Any items that the SMEs identify as being inadequately matched to the test blueprint, or flawed in any other way, are either revised or dropped from the test.
Assessors should ask: Do assessment items/components adequately and representatively sample the content area(s) to be measured?
Concurrent validity measures how well a new test compares to an well-established test. It can also refer to the practice of concurrently testing two groups at the same time, or asking two different groups of people to take the same test.
Once the tests have been scored, the relationship is estimated between the examinees’ known status as either masters or non-masters and their classification as masters or non-masters (i.e., pass or fail) based on the test. This type of validity provides evidence that the test is classifying examinees correctly. The stronger the correlation is, the greater the concurrent validity of the test is.
Read more about specific examples of concurrent validity here.
Construct is a hypothetical concept, used more often in the field psychology, that’s a part of the theories that try to explain human behaviour, e.g. intelligence and creativity. This type of validity tries to answer the question: “How can the test score be explained psychologically?”
Construct validity consists of obtaining evidence to support whether the observed behaviours in a test are (some) indicators of the construct. The validation process is in continuous reformulation and refinement. This is due to the fact that you can never fully demonstrate a ‘construct’.
Example construct validation process when designing an assessment:
- Based on the theory held at the time of the test, the examiner deducts certain hypotheses about the expected behavior of people who get different test scores.
- Next, they gather data that confirms or denies those hypotheses.
- Taking into account the gathered data, they decide whether the theory adequately explains the results. If that isn’t the case, they review the theory and repeat the process until they get a more accurate explanation. (Adapted from here.)
Assessors should ask: Do assessments and the assessment system measure the content they purport to measure?
Another statistical approach to validity is predictive validity. This approach is similar to concurrent validity, in that it measures the relationship between examinees’ performances on the test and their actual status as masters or non-masters. However, with predictive validity, it is the relationship of test scores to an examinee’s future performance – whether or not they have mastery of the content – that is estimated. In other words, predictive validity considers the question, “How well does the test predict examinees’ future status as masters or non-masters?”
For this type of validity, the correlation that is computed is between the examinees’ classifications as master or non-master based on the test and their later performance, perhaps on the job. This type of validity is especially useful for test purposes such as selection or admissions.
Assessors should ask: How well do assessment instrument predict how well candidates will do in future situations?
Like content validity, face validity is determined by a review of the items and not through the use of statistical analyses. Unlike content validity, face validity is not investigated through formal procedures and is not determined by subject matter experts. Instead, anyone who looks over the test, including examinees/candidates and other stakeholders, may develop an informal opinion as to whether or not the test is measuring what it is supposed to measure.
While it is clearly of some value to have the test appear to be valid, face validity alone is insufficient for establishing that the test is measuring what it claims to measure. A well developed examination programme will include formal studies into other, more substantive types of validity.
In summary, the validity of a test is the most critical indicator of test quality because, without sufficient validity, test scores have no meaning. The evidence you collect and document about the validity of your test is also your best legal defence should the examination programme ever be challenged in a court of law.
While there are several ways to estimate validity, for many certification and professional examination programmes the most important type of validity to establish is content validity.