PSY 525

PSY 525

Complications in the field of psychology

Constructs are not well defined nor are they directly observable (intelligence)

Compare to the problem of measuring something like brain size (assuming that it has some relation to intelligence)

How would this be done? Similar issues with every construct defined in

the DSM (note even the issue of lack of agreement between different versions of the DSM) and other diagnostic criteria such as the international classification of diseases-10)

Why should we assess?

What do we gain? What is the cost? How should it be done? What should be assessed?

First assignment is a 2 page paper on the first organizing question (think about this)

The functional role of assessment for patients

To provide a functional analysis of the patient (“what” they can and cannot do, with less emphasis on “why”)

To direct treatment (one must know what is wrong in order to select an intervention)

What interventions/therapies have you (or are you) learning and what are the indicators for implementing those interventions?

Your assessments count!

See Exhibit 1-1 pp 2-3 Daniel Hoffman v. Board of Education of city of NY

Reports are used by mental health workers, administrators, courts, etc.

Words can be misleading (be clear) IQs can change Different tests may provide diff. IQs Base decisions on multiple tests Use appropriate tests Review previous findings/testing

Types of Assessment (pp. 4-5)

Screening Focused Diagnostic Counseling and Rehabilitation Progress Evaluation Problem-solving

Four Pillars of Assessment

1. normed-referenced tests* 2. interviews 3. observations 4. informal assessment procedures

*Tests: a) same item content, b) same administration procedure, and c) same scoring criteria (i.e., must be standardized to be considered tests)

Steps in the assessment process

1. Review referral 2. Decide whether to accept it 3. Obtain relevant background info 4. Consider influence of relevant others 5. Observe client in multiple settings 6. Select/administer appropriate test

battery 7. Interpret the assessment 8. Develop/select intervention strategies 9. Write report 10. Meet with examinee 11. Follow-up an re-evaluate (see Dawes)

Clinical assessment/judgmentDawes et al., 1989

Although the literature is replete with criticisms of standardized assessments, how does the more informal version (i.e., clinical judgment) do relative to actuarial models?

How is your clinical judgment? Do you expect that it will improve with

training? How would you stack up compared to

someone with no training who is just given instructions to follow? See Dawes et al., 1989

Assessment patterns – what is assessed- Lubin et al, 1985

Most commonly used tests have varied over time and setting

Today, a wide variety of tests are employed, representing many diverse perspectives (from behavioral to psychodynamic)

Compare educational vs. inpatient vs. counseling settings

Tests employed driven by referral

Psychometrics Teach individuals about 5 tests, or teach

them how to evaluate tests? That feed you, or teach you to fish thingy.

Psychometrics allow for an understanding of what makes a test effective, how to evaluate them, how to create good ones, and how to extend this process to other methods of evaluation

This is where science (the process of doing research) and clinical work overlap

Scaling

Categorical – named categories Ordinal – order, but unequal intervals Interval – equal intervals but no true 0 point Ratio – a true 0 point

Only scale that technically allows for the calculation of a mean, SD, and most parametric statistics

The type of scale will dictate the type of statistics that can be used

E.g., a nominal scale should only use the mode as a measure of central tendency

Tools of the trade You must be very familiar (for this class and

ultimately, the licensing exam) with the following concepts – they will be reviewed in greater detail in the readings

Measures of central tendency – mean, mode, median How to calculate them, the strengths and

weaknesses of each, when to use them Measures of variability – SD & various ranges

How to calculate them, the strengths and weaknesses of each, when to use them

Tools – continued 1 How do the measures of central tendency

and variability relate to one another? Are there some that shouldn’t be used together?

Understanding the normal curve and probability theory (see overhead of percentages) This information is necessary in order to

interpret any assessment results Why is variability important? (between,

not within, individuals)

Reliability Reliability – consistency between raters

(see Cohen’s Kappa), between parallel versions of the same test, within the one test (split half, Chronbach’s alpha), & from one administration to another

How does this relate to standardization? How does this relate to variability? How does this relate to validity? How does this relate to the accuracy of measurement?

Standard error of measure (SEM) = SD X square root of (1 – the test’s reliability) Possible range of SEM is 0 to the test’s SD The smaller the SEM the better? Why?

Reliability and errorRating errors

Constant (leniency, severity, tendency to the mean), halo effects, contrast (with previous subject or oneself), proximity (an item’s location on the printed page can result in ratings similar to nearby items), most-recent-performance, and/or inadequate information errors

These can be minimized with more raters, exact instructions, intense training, frequent evaluation and recalibration

p. 2 – reliability and error Scale calibration

More items are typically needed to achieve high reliability, but there are exceptions

Guttman approach involves ordering items in terms of their level of difficulty (ascending)

How does one determine level of difficulty? This approach assumes that once a specified

number of items are missed, the more difficult items to follow will also be missed, therefore no need to administer them

Cost of this approach? Problems?

p. 3 – reliability and error Coefficient of determination or R-squared

is used when determining the amount of one variable that can be accounted for by a second variable (predictor)

Criterion contamination – when one knows information that makes it impossible to do a fair test of criterion validity (e.g., race of the skulls) Must conduct blind ratings

p. 4 – reliability and error Base rates represent an extremely

important source of information and are often ignored (e.g., Rosenthal’s famous study of students who claim to hear voices and admit themselves to a psychiatric hospital) Why is it easier to predict behaviors or

outcomes that occur at a base rate near 50%?

p. 5 – reliability and error All measurement represents an estimate of

whatever is being assessed. Therefore statistics are needed to help make such estimates (inferences)

Statistical power is crucial to decision making Alpha = Type I error or the probability of rejecting

the null when it is true Beta = Type II error or the probability of failing to

reject the null when it is in fact false. Parametric vs. nonparametric (few or no

distributional assumptions for the data)

Minimizing error Standardization – refers to the consistency

in applying methods Implications for testing Costs of violating standardization (weigh such a

decision very carefully, as there are major costs) Use of proper norms – when can the norm

group deviate from those to whom it is applied? What constitutes an effective norm group?

Systematic and random error Difficulties in detection and correction

An applied look at the SEM

In November of 2000, we tried to elect a president.

What happens when the margin of victory is smaller than the margin of error (SEM) in counting (the latter being approx 1 in 7,000)?

Impossible to ever know who really won Issue of what constitutes a “vote” (removal of

chad, depression of chad, intent to vote, etc.) What margin of victory is a real victory?

(significant difference is determined by the standard error of measure)

Validity Validity – is the test doing what you

think it’s doing? Face validity is important for lay

people (for them to believe the test is valid). Other advantages/disadvantages of face validity?

Content, construct, predictive, convergent, discriminant.

Internal/external validity (trade-off?)

Factor analysis: What is it? (validity?)

Assignment: Using point form, briefly describe the key events the television show “The Apprentice”

What themes emerge? How many different themes? Are the different themes related? Qualitative FA

Factor analysis represents a method of organizing & reducing data into latent (not directly assessed) constructs.

This is a mathematical rather than a conceptual (qualitative) organization of the data

Can be exploratory (no a priori theory) or confirmatory (compares data to a theory or previous data)

FA in APA journals: 70s = 4%, 80s = 9%, 90s = 21%

Factor analysis: Why do it? Data reduction

conserve df minimize problems of multicollinearity Models for computing composite scores & item parcels

Scale construction & revision (improve psychometrics) empirical validation (or revision) of theoretical models arrangement of items (pos & neg loadings) relative importance of different items; how central each is

to the latent construct (how to do item selection?) Factor(s) must be replicable, generalizable & interpretable Mean comparisons are irrelevant if factor structures differ e.g.,

typical study comparing males & females, patients to non-patients

Factor analysis: How to do it?Start with multi-item (min 3/construct) on a ratio or interval scaleEFA (Exploratory Factor Analysis) Ns can range from 5 subjects per item minimum to the ideal 10:1 ratio,

though depends on loadings and communality (1-unqueness). Min. = 100

EFA will determine the number of factors to extract Varies with the number of latent constructs and the number of items

(under vs. over extraction) Simulated sets of random data will still result in the emergence of

factor(s), so check scree plot of factors and their eigenvalues to find the descending linear trend (see p. 291).

Eigenvalue? The amount of the variance explain by each vector. Standardize items to z-scores (M=1, Var=1), sum of the variances = # items.

- Item loadings = the items correlation with the vector on which it loads.

When do factors dip below eigenvalues for a random data set with same N? (p. 291)

How many latent factors should emerge?

Scree plots, 50% variance rule & your theory (do not use the 1.0 eigenvalue default)

Low item loadings (.35 or <) typically represent error variance and will not replicate in an independent sample, so avoid factors made up of such item loadings

As items are added, more factors emerge, but this is not just an artifact of the number of items. It could be that new factors are emerging…

e.g., I often feel tired, I am rarely sad, I cry often, I never smile, I rarely sleep, Often I am not hungry

2 possible factors emerge assessing the latent constructs of depression and timeframe

How to do it? Oblique vs. Orthogonal rotations

You must determine the relation between all of the factors (assuming there is more than 1 factor) This should be based on a theoretical rationale Orthogonal (statistically independent/unrelated)

Items with high loadings on F1 are near 0 loading on F2 = simple structure

Oblique (stat. dependent); Fs allowed to inter-correlate

Orthogonal rotations – advantage is that it is more easily interpretable, though it may not fit well with the data (if the latent constructs are not independent)

Oblique rotations – advantage is that it can account for more of the data (especially if the latent constructs are not independent), though it is more difficult to interpret

How to interpret an EFA? Examine the number of factors and the number

of rotations needed for the factor structure to converge

Low loadings (< .35) are likely to be error variance

Factors with few items are likely to be spurious factors

Will the factor structure replicate in a second independent sample?

This is essential, especially when the initial EFA was truly exploratory

Emergent factors will capitalize on chance associations in the data set (i.e., the same type I errors observed whenever conducting numerous analyses)

Now that you have a factor structure, what next?

What is CFA? CFA is a powerful statistical technique that

allows one to define a model and then determine how well the data set matches the predicted model (using chi-square and several fit indices) – can test entire model simultaneously

The predicted model can be theoretically derived or empirically derived (see EFA findings), though if it is the later it MUST be on a different sample to allow for cross-validation

With large samples, randomly split the data (using random ID selection) into to equal sections

Minimum N for CFA = 200 million. Findings are more robust (stable) as N increases

Assessing the fit of the model in CFA Compare predicted model to observed data using the

chi-square statistic (the smaller the better = no sig difference between observed and expected)

Nested modeling – compare the fit of different models to each otherLaw of parsimony = all multifactor models must fit the data at a

level that is sig. better than a one factor model (calculate the chi-square difference)

Indices of fit also used to evaluate the fit of all models – based on model chi-square, null model chi-square, and df.

Comparative fit index (CFI) = 1-[(2m-dfm)/(2

n-dfn)] Bentler-Bonnet index (BBI) = (2

n - 2m)/2

n

Delta2 (small Ns), TLI (all require fit > .90), RMSR(0-.05).

Factor analysis and construct validity

No longer acceptable to publish a scale without considering its factorial structure

If your scale is supposed to assess one construct, then this issue can be empirically evaluated (EFAs & CFAs).

Ultimately, one can only test the number of factors and how they relate to one another, not the actual content of the factors (inferred from item content)

If the theory for the construct and the FA do not correspond, then there are two alternatives:

1) The underlying theoretical constructs may not be correctly specified (your theory is wrong)

2) The theory may be adequate, but the scale used to assess it is not

Examples?

Organization of constructs Factor structures – factor analysis (FA)

and confirmatory factor analysis (CFA) Differences between these procedures Meaning of eigenvalues (extraction) Rotations (e.g., oblique vs. orthogonal)

How are the constructs inter-related? Organizational and explanatory power Theoretical and (not vs) empirical

decisions

What is the construct of intelligence? How do you define it? How have others defined it?

This definition will determine how the tests are constructed, administered, and interpreted

What is intelligence and do modern IQ tests measure it? – write paper on this Various conflicting views on this (e.g., Gould

suggests that we can’t measure it and don’t with our current tests whereas Boring suggests that it IS what intelligence tests measure)

See handout of definitions

PSY 525

Intellectual AssessmentA few tests that we will focus on are those that are commonly used by psychologists, those that are psychometrically sound, and those that you need to know in order to do your job.

Assessing LD with the WAIS-III/IV Individuals with LD in reading and math

generally exhibit IQ scores in the average range.

Index scores are, however, noteworthy: VCI tend to be 7-13 points higher relative to

WMI scores (e.g., VCI is 15 points or greater than the WMI for almost 42% of those with reading disabilities).

POI is approx. 7 points higher than PSI scores for all LD individuals (e.g., POI scores are at 15 points higher than PSI scores for almost 31% of those with LDs).

Intelligence testing in problem populations The Leiter was developed to

evaluate cognitive functioning (i.e., intelligence) in individuals who are deaf-mute, nonverbal persons.

It can also be used with clients from other cultures who do not (or minimally) verbalize in English nor their native language

Neuropsychological Evaluation Head trauma is primary cause of closed

head injuries in the population (adolescents and adults)

Closed head injuries cause more widespread injuries and usually result in a period of lost consciousness.

Amnesia usually results (anterograde and retrograde)

Duration of anterograde amnesia is the best predictor of degree of injury and probability of recovery

PSY 525

Documents

used tests

assessment patterns

assessment process1

wide variety of tests

informal assessment

diagnostic criteria

clinical judgment

counseling settingstests