A Comprehensive Guide to Select Assessments for Evaluating Disorders and Disabilities


Whether you’re a current graduate student or an experienced clinician with decades of experience, you’ve been faced with deciding which assessments to use to evaluate their presenting concerns.

Why would you evaluate an assessment in the first place?

While you use certain assessments at the recommendation of trusted colleagues or perhaps as holdovers from your graduate school days, there are several reasons why you may have to evaluate unfamiliar assessments, including:

  1. The assessment you currently use to evaluate a certain domain suffers from poor psychometric properties, including faulty classification (e.g., over diagnosing, underdiagnosing)
  2. The assessment you currently use is difficult to administer, score, or interpret
  3. A completely new or revised assessment has been published for your area of practice and you want to determine whether it will be a good fit for the population you serve
  4. You’re evaluating an individual with a presenting problem for which you currently do not have an assessment

Considerations for test selection

Regardless of your reason for evaluating an assessment, you’ll want to have a trusted framework. That way, you’ll be able to dig up the most important information written in the assessment manual without spending hours slogging through the details that are less pertinent to you.

Although each of the following considerations warrant their own article, they will collectively help you decide whether a given assessment is the right fit for you and your population:

  1. Validity

The basic question: Does this assessment actually measure what I intend to evaluate?

Typically, published assessments will include some reference to the construct they measure in their title. For example, the Social Responsive Scale, 2nd Edition (SRS-2) measures—you guessed it—social responsiveness.

However, there is often much more to a given assessment than meets the eye. Specifically, an assessment may measure multiple related constructs, or various areas within a construct.

Information about what constructs an assessment measures is usually discussed in the introductory chapter of its manual.

That said, what a publisher claims an assessment measures should be evaluated with greater scrutiny.

In other words, data contained in the manual should offer support that the assessment measures what the publisher says it does, and doesn’t measure what it’s not supposed to. Collectively, this is referred to as the test’s validity.

Now, discussing validity opens a whole different can of worms because there are multiple aspects to it, including:

  • Construct validity: The degree to which an assessment relates to an underlying theory
    • Convergent validity: Ideally, scores from one assessment should be associated with scores from another assessment measuring the same construct.
      • Example: Scores from one assessment of adaptive behavior should be associated with scores from a different assessment of adaptive behavior.
    • Divergent (or discriminant) validity: Ideally, scores from one assessment should not be associated with unrelated constructs.
      • Example: An assessment of attention-deficit/hyperactivity disorder (ADHD) should not be strongly associated with an assessment of hearing.
    • Content validity: An examination of whether an assessment includes appropriate content to measure the intended construct
      • Face validity: Assessment items should appear to measure what they intend to.
        • Example: Items in an assessment intended to measure autism spectrum disorder (ASD) should appear to measure social communication and restricted and repetitive behaviors.
      • Criterion validity: The degree to which an assessment is associated with established measures or outcomes (i.e., the criteria)
        • Concurrent validity: Scores for an assessment should be associated with scores from another assessment of the same construct collected at the same time
          • Example: An assessment of reading fluency and comprehension should yield scores that are highly associated with the student’s current reading grades.
        • Predictive validity: Scores for an assessment should be associated with scores from another assessment of the same construct collected in the future
          • Example: Scores for an assessment of reading fluency and comprehension should predicted (i.e., be associated with) future reading grades.

Where information on validity can typically be found in a manual: The Psychometric Properties chapter

  1. Reliability

The basic question: Will this assessment give me consistent results?

Imagine you visit a local Italian restaurant for dinner. Some days, your preferred pasta dish tastes like the best thing you’ve ever eaten. Other days, you leave the restaurant wishing you’d stayed home and cooked for yourself because the pasta wasn’t very good.

If the restaurant consistently produced incredible pasta, it would be reliable. On the other hand, an inconsistent restaurant would not be reliable.

Of course, your chosen assessments should be reliable. Specifically, you’ll want to know that an assessment will produce the same result—within a few degrees of freedom—each time you administer it. This is referred to as test-retest reliability.

However, there is more than one aspect of reliability to evaluate. To explain the other most commonly discussed aspect, let’s return to the Italian restaurant example.

Suppose, after you decided not to frequent the first Italian restaurant, you begin to dine at another. Each time you visit the restaurant, the pasta tastes absolutely perfect to your palette.

Naturally, you invite your best friend to join you for a future visit. Unfortunately, your friend disagrees about the food’s quality. This lack of agreement between you and your friend reflects poor interrater reliability (a.k.a. interrater agreement).

With any assessment, you’ll want to determine the degree to which scores differ based on who administered the assessment, or who provided ratings.

In the case of different clinicians administering the same assessment, the higher the interrater reliability coefficient, the better. On the other hand, when it comes to different raters (e.g., parent vs. teacher), scores naturally differ.

However, variable scores are not necessarily a bad thing, as different raters contribute contextual specificity. In other words, reported behaviors differ less due to interrater disagreement and more due to the different contexts in which raters observe the behaviors (e.g., home vs. school).

Where information on reliability can typically be found in a manual: The Psychometric Properties chapter

  1. Sampling Method

The basic question: Has this test been administered to enough individuals in the population I serve?

The answer, of course, is much more complex.

The assessment’s standardization sample should contain roughly the same percentage of individuals from various groups as those groups are represented in the population. Therefore, at minimum, test publishers aim to collect data from a nationally representative sample with regard to:

  • Age
  • Gender
  • Race/ethnicity
  • Socioeconomic status (SES; typically measured as either parent’s highest educational degree obtained)
  • Region (East, Midwest, South, West)

Once data is collected, test publishers aim to demonstrate that there are no meaningful (i.e., significant) score differences where none are expected. However, if there are score differences between, say, males and females, steps can be taken to mitigate these differences, such as by creating separate norms.

Of course, these five points cannot capture the diversity of individuals—racial, ability level, etc.—with which the assessment will be used. For example, there is incredible racial and cultural diversity within the “Asian/Pacific Islander” subgroup, and it would be impossible to account for this level of heterogeneity.

Still, there are reasons to collect additional data from individuals that belong to certain special populations to demonstrate the assessment’s validity with that group.

For example, if one primary goal of an adaptive behavior assessment is to identify individuals who may have intellectual disability (ID), it’s imperative that the publisher collect data from individuals with substantiated ID to demonstrate its effectiveness in discriminating between individuals with and without the condition.

This same approach to collecting validity samples transcends clinical groups and may be applied to individuals with different racial backgrounds, as well as levels of language proficiency, hearing ability, and many others.

Where information on reliability can typically be found in a manual: The Development and Standardization chapter

  1. Functionality

The basic question: Is this assessment easy to use, and does it offer the resources I need to make my job and life easier?

This area is the most difficult to evaluate objectively because much of it depends on the clinician’s needs and preferences.

That said, there are several important considerations when evaluating an assessment’s functionality, including:

  • Administration
    • Is the assessment easily administered by the clinician?
    • Is the assessment easily understood and completed by the rater?
    • Is the assessment available in different languages?
    • Can the test be effectively administered to individuals with special needs, such as deafness or blindness?
  • Scoring
    • What scores (i.e., information) do I actually get from the assessment?
    • Does scoring take a long time?
    • Is the scoring process clear?
    • Do I have to score by hand, or are there electronic scoring options?
  • Reporting
    • Does the assessment offer score interpretation?
    • Is there any way to monitor an individual’s progress?
    • Can I empirically compare different raters’ scores?
  • Intervention Resources
    • Does the assessment include specific intervention guidance for areas of concern?

Where information on functionality can typically be found in a manual: The Introduction, Administration and Scoring, and Interpretation and Intervention chapters.

For more information regarding clinical assessments visit https://www.wpspublish.com/

Content provided by: Shirag Shemmassian, Ph.D.