PSY 525
Jan 01, 2016
PSY 525
Complications in the field of psychology
Constructs are not well defined nor are they directly observable (intelligence)
Compare to the problem of measuring something like brain size (assuming that it has some relation to intelligence)
How would this be done? Similar issues with every construct defined in
the DSM (note even the issue of lack of agreement between different versions of the DSM) and other diagnostic criteria such as the international classification of diseases-10)
Why should we assess?
What do we gain? What is the cost? How should it be done? What should be assessed?
First assignment is a 2 page paper on the first organizing question (think about this)
The functional role of assessment for patients
To provide a functional analysis of the patient (“what” they can and cannot do, with less emphasis on “why”)
To direct treatment (one must know what is wrong in order to select an intervention)
What interventions/therapies have you (or are you) learning and what are the indicators for implementing those interventions?
Your assessments count!
See Exhibit 1-1 pp 2-3 Daniel Hoffman v. Board of Education of city of NY
Reports are used by mental health workers, administrators, courts, etc.
Words can be misleading (be clear) IQs can change Different tests may provide diff. IQs Base decisions on multiple tests Use appropriate tests Review previous findings/testing
Types of Assessment (pp. 4-5)
Screening Focused Diagnostic Counseling and Rehabilitation Progress Evaluation Problem-solving
Four Pillars of Assessment
1. normed-referenced tests* 2. interviews 3. observations 4. informal assessment procedures
*Tests: a) same item content, b) same administration procedure, and c) same scoring criteria (i.e., must be standardized to be considered tests)
Steps in the assessment process
1. Review referral 2. Decide whether to accept it 3. Obtain relevant background info 4. Consider influence of relevant others 5. Observe client in multiple settings 6. Select/administer appropriate test
battery 7. Interpret the assessment 8. Develop/select intervention strategies 9. Write report 10. Meet with examinee 11. Follow-up an re-evaluate (see Dawes)
Clinical assessment/judgmentDawes et al., 1989
Although the literature is replete with criticisms of standardized assessments, how does the more informal version (i.e., clinical judgment) do relative to actuarial models?
How is your clinical judgment? Do you expect that it will improve with
training? How would you stack up compared to
someone with no training who is just given instructions to follow? See Dawes et al., 1989
Assessment patterns – what is assessed- Lubin et al, 1985
Most commonly used tests have varied over time and setting
Today, a wide variety of tests are employed, representing many diverse perspectives (from behavioral to psychodynamic)
Compare educational vs. inpatient vs. counseling settings
Tests employed driven by referral
Psychometrics Teach individuals about 5 tests, or teach
them how to evaluate tests? That feed you, or teach you to fish thingy.
Psychometrics allow for an understanding of what makes a test effective, how to evaluate them, how to create good ones, and how to extend this process to other methods of evaluation
This is where science (the process of doing research) and clinical work overlap
Scaling
Categorical – named categories Ordinal – order, but unequal intervals Interval – equal intervals but no true 0 point Ratio – a true 0 point
Only scale that technically allows for the calculation of a mean, SD, and most parametric statistics
The type of scale will dictate the type of statistics that can be used
E.g., a nominal scale should only use the mode as a measure of central tendency
Tools of the trade You must be very familiar (for this class and
ultimately, the licensing exam) with the following concepts – they will be reviewed in greater detail in the readings
Measures of central tendency – mean, mode, median How to calculate them, the strengths and
weaknesses of each, when to use them Measures of variability – SD & various ranges
How to calculate them, the strengths and weaknesses of each, when to use them
Tools – continued 1 How do the measures of central tendency
and variability relate to one another? Are there some that shouldn’t be used together?
Understanding the normal curve and probability theory (see overhead of percentages) This information is necessary in order to
interpret any assessment results Why is variability important? (between,
not within, individuals)
Reliability Reliability – consistency between raters
(see Cohen’s Kappa), between parallel versions of the same test, within the one test (split half, Chronbach’s alpha), & from one administration to another
How does this relate to standardization? How does this relate to variability? How does this relate to validity? How does this relate to the accuracy of measurement?
Standard error of measure (SEM) = SD X square root of (1 – the test’s reliability) Possible range of SEM is 0 to the test’s SD The smaller the SEM the better? Why?
Reliability and errorRating errors
Constant (leniency, severity, tendency to the mean), halo effects, contrast (with previous subject or oneself), proximity (an item’s location on the printed page can result in ratings similar to nearby items), most-recent-performance, and/or inadequate information errors
These can be minimized with more raters, exact instructions, intense training, frequent evaluation and recalibration
p. 2 – reliability and error Scale calibration
More items are typically needed to achieve high reliability, but there are exceptions
Guttman approach involves ordering items in terms of their level of difficulty (ascending)
How does one determine level of difficulty? This approach assumes that once a specified
number of items are missed, the more difficult items to follow will also be missed, therefore no need to administer them
Cost of this approach? Problems?
p. 3 – reliability and error Coefficient of determination or R-squared
is used when determining the amount of one variable that can be accounted for by a second variable (predictor)
Criterion contamination – when one knows information that makes it impossible to do a fair test of criterion validity (e.g., race of the skulls) Must conduct blind ratings
p. 4 – reliability and error Base rates represent an extremely
important source of information and are often ignored (e.g., Rosenthal’s famous study of students who claim to hear voices and admit themselves to a psychiatric hospital) Why is it easier to predict behaviors or
outcomes that occur at a base rate near 50%?
p. 5 – reliability and error All measurement represents an estimate of
whatever is being assessed. Therefore statistics are needed to help make such estimates (inferences)
Statistical power is crucial to decision making Alpha = Type I error or the probability of rejecting
the null when it is true Beta = Type II error or the probability of failing to
reject the null when it is in fact false. Parametric vs. nonparametric (few or no
distributional assumptions for the data)
Minimizing error Standardization – refers to the consistency
in applying methods Implications for testing Costs of violating standardization (weigh such a
decision very carefully, as there are major costs) Use of proper norms – when can the norm
group deviate from those to whom it is applied? What constitutes an effective norm group?
Systematic and random error Difficulties in detection and correction
An applied look at the SEM
In November of 2000, we tried to elect a president.
What happens when the margin of victory is smaller than the margin of error (SEM) in counting (the latter being approx 1 in 7,000)?
Impossible to ever know who really won Issue of what constitutes a “vote” (removal of
chad, depression of chad, intent to vote, etc.) What margin of victory is a real victory?
(significant difference is determined by the standard error of measure)
Validity Validity – is the test doing what you
think it’s doing? Face validity is important for lay
people (for them to believe the test is valid). Other advantages/disadvantages of face validity?
Content, construct, predictive, convergent, discriminant.
Internal/external validity (trade-off?)
Factor analysis: What is it? (validity?)
Assignment: Using point form, briefly describe the key events the television show “The Apprentice”
What themes emerge? How many different themes? Are the different themes related? Qualitative FA
Factor analysis represents a method of organizing & reducing data into latent (not directly assessed) constructs.
This is a mathematical rather than a conceptual (qualitative) organization of the data
Can be exploratory (no a priori theory) or confirmatory (compares data to a theory or previous data)
FA in APA journals: 70s = 4%, 80s = 9%, 90s = 21%
Factor analysis: Why do it? Data reduction
conserve df minimize problems of multicollinearity Models for computing composite scores & item parcels
Scale construction & revision (improve psychometrics) empirical validation (or revision) of theoretical models arrangement of items (pos & neg loadings) relative importance of different items; how central each is
to the latent construct (how to do item selection?) Factor(s) must be replicable, generalizable & interpretable Mean comparisons are irrelevant if factor structures differ e.g.,
typical study comparing males & females, patients to non-patients
Factor analysis: How to do it?Start with multi-item (min 3/construct) on a ratio or interval scaleEFA (Exploratory Factor Analysis) Ns can range from 5 subjects per item minimum to the ideal 10:1 ratio,
though depends on loadings and communality (1-unqueness). Min. = 100
EFA will determine the number of factors to extract Varies with the number of latent constructs and the number of items
(under vs. over extraction) Simulated sets of random data will still result in the emergence of
factor(s), so check scree plot of factors and their eigenvalues to find the descending linear trend (see p. 291).
Eigenvalue? The amount of the variance explain by each vector. Standardize items to z-scores (M=1, Var=1), sum of the variances = # items.
- Item loadings = the items correlation with the vector on which it loads.
When do factors dip below eigenvalues for a random data set with same N? (p. 291)
How many latent factors should emerge?
Scree plots, 50% variance rule & your theory (do not use the 1.0 eigenvalue default)
Low item loadings (.35 or <) typically represent error variance and will not replicate in an independent sample, so avoid factors made up of such item loadings
As items are added, more factors emerge, but this is not just an artifact of the number of items. It could be that new factors are emerging…
e.g., I often feel tired, I am rarely sad, I cry often, I never smile, I rarely sleep, Often I am not hungry
2 possible factors emerge assessing the latent constructs of depression and timeframe
How to do it? Oblique vs. Orthogonal rotations
You must determine the relation between all of the factors (assuming there is more than 1 factor) This should be based on a theoretical rationale Orthogonal (statistically independent/unrelated)
Items with high loadings on F1 are near 0 loading on F2 = simple structure
Oblique (stat. dependent); Fs allowed to inter-correlate
Orthogonal rotations – advantage is that it is more easily interpretable, though it may not fit well with the data (if the latent constructs are not independent)
Oblique rotations – advantage is that it can account for more of the data (especially if the latent constructs are not independent), though it is more difficult to interpret
How to interpret an EFA? Examine the number of factors and the number
of rotations needed for the factor structure to converge
Low loadings (< .35) are likely to be error variance
Factors with few items are likely to be spurious factors
Will the factor structure replicate in a second independent sample?
This is essential, especially when the initial EFA was truly exploratory
Emergent factors will capitalize on chance associations in the data set (i.e., the same type I errors observed whenever conducting numerous analyses)
Now that you have a factor structure, what next?
What is CFA? CFA is a powerful statistical technique that
allows one to define a model and then determine how well the data set matches the predicted model (using chi-square and several fit indices) – can test entire model simultaneously
The predicted model can be theoretically derived or empirically derived (see EFA findings), though if it is the later it MUST be on a different sample to allow for cross-validation
With large samples, randomly split the data (using random ID selection) into to equal sections
Minimum N for CFA = 200 million. Findings are more robust (stable) as N increases
Assessing the fit of the model in CFA Compare predicted model to observed data using the
chi-square statistic (the smaller the better = no sig difference between observed and expected)
Nested modeling – compare the fit of different models to each otherLaw of parsimony = all multifactor models must fit the data at a
level that is sig. better than a one factor model (calculate the chi-square difference)
Indices of fit also used to evaluate the fit of all models – based on model chi-square, null model chi-square, and df.
Comparative fit index (CFI) = 1-[(2m-dfm)/(2
n-dfn)] Bentler-Bonnet index (BBI) = (2
n - 2m)/2
n
Delta2 (small Ns), TLI (all require fit > .90), RMSR(0-.05).
Factor analysis and construct validity
No longer acceptable to publish a scale without considering its factorial structure
If your scale is supposed to assess one construct, then this issue can be empirically evaluated (EFAs & CFAs).
Ultimately, one can only test the number of factors and how they relate to one another, not the actual content of the factors (inferred from item content)
If the theory for the construct and the FA do not correspond, then there are two alternatives:
1) The underlying theoretical constructs may not be correctly specified (your theory is wrong)
2) The theory may be adequate, but the scale used to assess it is not
Examples?
Organization of constructs Factor structures – factor analysis (FA)
and confirmatory factor analysis (CFA) Differences between these procedures Meaning of eigenvalues (extraction) Rotations (e.g., oblique vs. orthogonal)
How are the constructs inter-related? Organizational and explanatory power Theoretical and (not vs) empirical
decisions
What is the construct of intelligence? How do you define it? How have others defined it?
This definition will determine how the tests are constructed, administered, and interpreted
What is intelligence and do modern IQ tests measure it? – write paper on this Various conflicting views on this (e.g., Gould
suggests that we can’t measure it and don’t with our current tests whereas Boring suggests that it IS what intelligence tests measure)
See handout of definitions
PSY 525
Intellectual AssessmentA few tests that we will focus on are those that are commonly used by psychologists, those that are psychometrically sound, and those that you need to know in order to do your job.
Assessing LD with the WAIS-III/IV Individuals with LD in reading and math
generally exhibit IQ scores in the average range.
Index scores are, however, noteworthy: VCI tend to be 7-13 points higher relative to
WMI scores (e.g., VCI is 15 points or greater than the WMI for almost 42% of those with reading disabilities).
POI is approx. 7 points higher than PSI scores for all LD individuals (e.g., POI scores are at 15 points higher than PSI scores for almost 31% of those with LDs).
Intelligence testing in problem populations The Leiter was developed to
evaluate cognitive functioning (i.e., intelligence) in individuals who are deaf-mute, nonverbal persons.
It can also be used with clients from other cultures who do not (or minimally) verbalize in English nor their native language
Neuropsychological Evaluation Head trauma is primary cause of closed
head injuries in the population (adolescents and adults)
Closed head injuries cause more widespread injuries and usually result in a period of lost consciousness.
Amnesia usually results (anterograde and retrograde)
Duration of anterograde amnesia is the best predictor of degree of injury and probability of recovery