The response category labeling effect: How the wording of labels affects response distributions in Likert data Bert Weijters Maggie Geuens Hans Baumgartner.

The response category labeling effect:How the wording of labels affects

response distributions in Likert data

Bert Weijters

Maggie Geuens

Hans Baumgartner

The response category labeling effect

Research questions Do the labels attached to scale categories influence

response behavior? What mechanism(s) can account for this response

category labeling effect? Are there moderators of this effect? What are the implications of the response category

labeling effect for cross-cultural research?


The importance of category labels

________ ________ ________ ________ ________

strongly disagree disagree neither agree nor disagree

agree strongly agree

________ ________ ________ ________ ________

completely disagree

disagree neither agree nor disagree

agree completelyagree

versus

I try to avoid foods that are high in cholesterol.


The intensity hypothesis Label intensity refers to the perceived degree of

(dis)agreement implied by the label; More intense labels represent more extreme

positions, which are endorsed less often (e.g., agree vs. strongly agree; superior vs. very good);

Even more subtle adverbial modifiers (e.g., strongly vs. completely agree) may influence response behavior;

Prior evidence that different intensities are associated with different adverbs (e.g., Cliff 1959; Smith et al. 2009), but little evidence that different adverbs lead to differential category endorsement;


The fluency hypothesis Research on processing fluency shows that the meta-cognitive

experience of ease of processing affects judgment and decision making: perceptions of the truth value of statements (e.g., Unkelbach 2007); liking for objects and events (e.g., Reber, Schwarz, and Winkielman

2004); Repeated statements are more likely to be rated as true (Unkelbach

2007) and repetition increases liking, as suggested by the mere exposure effect (e.g., Bornstein 1989), in part because repetition makes stimuli more familiar and contributes to greater processing fluency;

Words vary in how often they are encountered, and high frequency words are processed more fluently;

If scale labels are more commonly used in everyday language and are thus easier to process, this may increase the likelihood that the corresponding response option on the rating scale is selected;


Two alternative hypotheses to explain the effect of response

category labelsIntensity hypothesis:

H1: Response categories are endorsed less frequently if their labels are more intense.

Fluency hypothesis:

H2: Response categories are endorsed more frequently if their labels are more fluent.


Verbal ability as a moderator of the fluency effect

when people are processing more carefully or when people are highly experienced, their actual thoughts, not the ease of generating them, play a more decisive role;

Verbal ability (as a form of language expertise) may moderate the fluency effect;

We posit that for respondents who tend to use words in a precise manner and who make fine-grained distinctions as to the exact meaning and implications of words, fluency will be less important as a cue in selecting a response;

The response category labeling effectStudy 1: Scaling intensity and

fluency• Do different methods for scaling the intensity

and fluency of response category labels lead to similar results?

If the intensity or fluency of scale labels is to have a reliable effect on responses to questionnaires, consistent differences in the perceived intensity and fluency of category labels should emerge across respondents.

• Can we identify endpoint labels that vary significantly in intensity and fluency for use in subsequent studies?

We need two labels that imply contradictory responses under the intensity and fluency hypotheses.


Study 1 (cont’d)• Label intensity

– Direct ratings of intensity (0 = neutral; 10 = 100% agreement)– Pairwise comparisons of intensity (“Which expression indicates

the stronger sense of agreement?”)

• Label fluency– Direct ratings of fluency (0 = we never use this term in day-to-day

speech; 10 = we use this term very often in day-to-day speech)– Pairwise comparisons of fluency (“Which expression is more

commonly used in day-to-day speech?”)– Lexical decision task (press a button labeled ‘end category label’

or ‘not an end category label’ for 6 endpoint labels and five non-endpoint labels)

– Word frequency counts in corpora of texts (Google hits, available for specific word combinations in particular countries and languages)


Study 1: Method

Sample 1: 83 undergraduates; pairwise comparisons of intensity and fluency of six endpoint labels;

Sample 2: 112 respondents (mean age 32.03, 66% female) from an online panel; direct ratings of intensity and fluency on 11-point scales;

Sample 3: 125 under graduates (57% female); lexical decision task;


INTENSITY FLUENCY

Dutch label Free translationPaired

comparisonDirect rating

Paired comparison

Direct rating

Response latency

Google hits (in

millions)

Sterk (on)eens Strongly (dis)agree0.94 (.13) 8.89a

(.15)1.14 (.11) 4.62b

(.27)1011.53a (55.01)

11.70

Zeer (on)eens Very much (dis)agree1.43 (.12) 8.60a

(.18)1.65 (.12) 3.49a

(.24)1002.93a (54.86)

38.20

Zeker (on)eens Certainly (dis)agree2.11 (.11) 8.78a

(.20)2.40 (.10) 6.05c

(.28) 989.72a (54.61)

35.90

Uitgesproken

(on)eens

Distinctly (dis)agree2.98 (.18) 9.57b

(.22)1.18 (.13) 3.81b

(.28)1021.87a (54.73)

2.16

Helemaal (on)eens Fully (dis)agree3.72 (.13) 10.54c

(.12)4.24 (.08) 9.62d

(.16) 724.88b (55.83)

33.60

Volledig (on)eens Completely (dis)agree3.82 (.12) 10.56c

(.10)4.39 (.08) 9.59d

(.18) 672.73b (55.21)

67.10

Study 1: Results


Study 1: Results (cont’d) For intensity, the correlation of the means obtained from

the paired comparison and direct rating tasks is .92; The correlations of the means derived from the four

fluency methods range from .66 to .97, with an average of r = .84;

Thus, there is considerable consistency in respondents’ judgments of the perceived intensity and fluency of different category labels;

‘sterk eens’ (strongly agree) consistently emerged as one of the least intense and least fluent labels, while ‘volledig eens’ (completely agree) surfaced as one of the most intense and most fluent labels;


Study 2 Direct test of the intensity and fluency hypotheses:

The endorsement rate for a high intensity and high fluency label should be relatively low if the intensity hypothesis is true, and it should be relatively high if the fluency hypothesis is true.

Preliminary test of whether the intensity/fluency of labels affects predictive validity.


Measuring response distributions

A major challenge is to measure differences in response distributions that are not item-specific and independent of substantive content;

To do this, we need to observe patterns of responses across heterogeneous items (i.e., items that do not share common content but have the same response format):

Deliberately designed scales consisting of heterogeneous items (Greenleaf 1992)

Random samples of items from scale inventories (Weijters, Geuens & Schillewaert 2010)


Study 2: Method 161 Dutch-speaking respondents (mean age 31.27, 67%

female) from a university panel were randomly assigned to two versions of an online questionnaire:

□ Endpoint labels of ‘completely (dis)agree’□ Endpoint labels of ‘strongly (dis)agree’

Four sections:□ 6 attitudinal items, one of which was “I love to go out for dinner”; □ 10 heterogeneous items from unrelated scales (e.g., “I am a

sensitive person”, “Financial security is important to me”), rated on 5-point scales;

□ Direct ratings of the intensity and fluency of six end labels (100-point scale for intensity, 11-point scale for fluency);

□ Behavioral measure of choice between five different vouchers worth 15 EUR (cinema, book, restaurant, theatre, gym);


Study 2: Results

The findings support the fluency hypothesis:

Intensity Fluency Mean endorsement of the extreme positive category

Strongly agree 75.47 3.20 2.47

Completely agree 93.63 8.22 3.61

Logistic regression of choice of restaurant voucher on label, attitude toward going out for dinner, and interaction indicates a significant interaction: predictive validity is better for ‘strongly agree’ than ‘completely agree’.

(p<.001 based on a Poisson regression)


Study 3

Replication of the fluency effect with a sample drawn from the general population;

Literacy as a potential moderator;


Study 3: Method 369 Dutch-speaking panel members (mean age 45.8, 50%

female) of an online market research agency in Belgium were randomly assigned to two versions of an online questionnaire: Endpoint labels of ‘completely (dis)agree’ Endpoint labels of ‘strongly (dis)agree’

Questionnaire: 16 heterogeneous items based on Greenleaf (1992), rated

on 5-point scales; Pairwise comparisons of four endpoint labels in terms of

intensity and fluency (strongly, completely, fully, and absolutely);

Literacy measure: “I do a lot of reading” and “I prefer activities that don’t require a lot of reading” (strongly associated with having a higher education);


Study 3: Results

The findings support the fluency hypothesis:

Intensity Fluency Mean endorsement of the extreme categories

Strongly agree .74 .65 2.66

Completely agree 2.02 2.18 3.05

Fluency effect occurs primarily for respondents with lower literacy;

(p<.05 based on a Poisson regression)


Sterk

(on)

eens

Volled

ig (o

n)ee

ns

0

1

2

3

4

Low literacyHigh literacy

Nu

mb

er

of

en

dp

oin

t re

spo

nse

s

The moderating effect of literacy on the fluency effect


Study 4: Method 271 Dutch-speaking panel members (mean age 39.2,

51% female) of an online market research agency in Belgium were randomly assigned to two versions of an online questionnaire: Endpoint labels of ‘completely (dis)agree’ Endpoint labels of ‘strongly (dis)agree’

Questionnaire: 10 heterogeneous items, rated on 5-point scales; Pairwise comparisons of six response category labels in terms of

intensity and fluency; Antonym test as a measure of verbal ability (4 items); antonym

test strongly associated with having a higher education;


Study 4: Results

Manipulation checks:

Intensity Fluency

Strongly agree 1.93 1.67

Completely agree 3.44 4.26

Fluency effect occurs primarily for respondents low in verbal ability (significant interaction, with significant simple main effect for low verbal ability respondents);


The moderating effect of verbal ability on the fluency effect

Sterk

(on)

eens

Volled

ig (o

n)ee

ns0

1

2

3

4

5

Low verbal abilityHigh verbal ability

Nu

mb

er

of

en

dp

oin

t re

spo

nse

s


Implications of the category labeling effect

for cross-cultural research Response category labels can affect findings in a single-language

context (e.g., estimation of population parameters, meta-analytic comparisons), but they are particularly important in cross-cultural research, where labels have to be translated;

Two types of translation: Literal Idiomatic

Some authors have emphasized the need to choose scale anchors that are equal in intensity (e.g., Harzing 2006), and prior research has demonstrated that supposedly similar terms may differ in intensity across languages (e.g., definitely vs. bestimmt; see Smith et al. 2009);

Translated adverbial modifiers may also differ in fluency;


Schematic representation of the translation process

(based on Bassetti and Cook 2011)


Study 5: Method

Approx. 200 English- or French-speaking respondents in five regions (nationality/language combinations) of North America and Europe;

Five endpoint labels in each language; 16 heterogeneous items from Greenleaf (1992),

rated on 5-point scales; Pairwise comparisons of the six labels plus “agree”

or “d’accord” in terms of intensity and fluency;


Study 5: Method

France USA Canada UK Total

Language French 227 0 203 0 430

English 0 185 196 187 568

Total 227 382 399 187 998

Version English French

1 Strongly agree Fortement d'accord2 Completely agree Complètement d'accord3 Extremely agree Extrêmement d'accord4 Definitely agree Définitivement d'accord5 Fully agree Entièrement d'accord6 Very much agree Tout à fait d'accord


Study 5: ResultsIntensity and fluency ratings by region

Note: Correlation between the fluency ratings and the natural logarithm of the number of Google hits was at least .88.


Dependent variable Independent variable B SE t p

Individual extreme responding (level 1)

Fluency -.015 .029 -.512 .609

Intensity .020 .017 1.213 .225

Group-level extreme responding (level 2)

Fluency .164 .065 2.532 .011

Intensity -.130 .132 -.979 .327

Language = French .054 .087 .621 .535

Country = USA .114 .100 1.139 .255

Country = France .002 .076 .026 .979

Country = UK -.004 .096 -.040 .968

Intercept term .995 .060 16.486 .000

Study 5: ResultsMultilevel model estimates


Study 6 Demonstration that fluency is a viable determinant

of extreme responding differences between regions in an international survey;

Illustration of how to construct and use relative measures of fluency and extreme responding based on secondary data only;


Study 6: Method

13,520 respondents from 17 European regions; 16 heterogeneous items based on Greenleaf (1992); Use of fully labeled 7-point response scales; Fluency: relative measure of fluency as the natural

logarithm of the ratio of the number of Google hits for the 7th category (strongly agree) to the number of Google hits for the 6th category (agree);

Endorsement: relative endorsement of the 7th vs. the 6th response category (natural logarithm).


N female M age SD age

Belgium, Dutch 644 51% 41.0 11.1

Belgium, French 371 51% 40.5 11.7

UK, English 908 56% 41.8 11.3

Germany, German 993 50% 39.3 11.0

Hungary, Hungarian 1003 51% 38.3 11.8

Slovakia, Slovakian 1063 50% 38.2 12.1

Poland, Polish 802 37% 32.2 11.0

Netherlands, Dutch 1046 50% 40.8 11.4

France, French 1000 51% 39.4 11.9

Spain, Spanish 934 50% 37.8 10.5

Romania, Romanian 970 50% 37.9 11.5

Turkey, Turkish 914 43% 32.5 9.4

Italy, Italian 939 50% 39.0 10.6

Switzerland, French 303 51% 42.5 9.7

Switzerland, German 606 48% 43.5 9.4

Switzerland, Italian 50 56% 32.9 8.7

Sweden, Swedish 974 49% 39.9 11.3

Total 13520 49% 38.7 11.4

Sample descriptive statistics Pan-European study (Study 7 and 8)


Study 6: Results

Note: Standardized regression slope of .67 (p<.01, R2=.45)


Discussion: Summary of findings response category labels that are more commonly used

(i.e., that are more fluent) lead to higher endorsement of their associated response categories;

respondents do not simply scale response categories along an intensity dimension and then map their latent response to the best-matching category, but they are also influenced by the fluency of the labels;

the effect of fluency is more pronounced for respondents who are lower in literacy and verbal ability;

the problem may be particularly serious in cross-cultural research when different languages are used;


Implications formultilingual survey research

□ Translations usually imply a trade-off between the attempt to be literal and the attempt to be idiomatic;

□ Optimize equivalence: use response category labels that are equally fluent in different languages (rather than literal translations or words with equal intensity);

e.g., ‘Strongly agree’ is most commonly used in scales, but may not have valid equivalents in some other languages. ‘Completely agree’ seems to be a viable alternative.

fluency ERS%

Completely agree 1.24 18.8%

Tout à fait d’accord 1.22 19.2%