Page 1
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 1/36
1
SUBTLEX-UK:
A new and improved word frequency database for Britis En!"is
Walter J. B. van Heuven1 Pawel Mandera
2 Emmanuel Keuleers
2 Marc Brysbaert
2,3
1
University ! "ttin#$am, UK
2 %$ent University, Bel#ium
3 &wansea University
Keywrds' Wrd !re(uency, visual wrd rec#nitin, )i*! scale
+unnin# $ead' &UB-E/UK
0ddress' r. Walter van Heuven
&c$l ! Psyc$l#y
University ! "ttin#$am
University Par
"ttin#$am, "% 2+
P$ne' 455 116 7588383
9a:' 455 116 ;616325
Email' walter.van$euven<nttin#$am.ac.u
Page 2
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 2/36
2
Abstract
We *resent wrd !re(uencies based n subtitles ! Britis$ televisin *r#rams. We s$w
t$at t$e &UB-E/UK wrd !re(uencies e:*lain mre ! t$e variance in t$e le:ical decisin
times ! t$e Britis$ -e:icn Pr=ect t$an t$e wrd !re(uencies based n t$e Britis$ "atinal
>r*us and t$e &UB-E/U& !re(uencies. ?n additin t t$e wrd !rm !re(uencies, we als
*resent measures ! cnte:tual diversity, *art/!/s*eec$ s*eci!ic wrd !re(uencies, wrd
!re(uencies in c$ildren *r#rams, and wrd bi#ram !re(uencies, #ivin# researc$ers ! Britis$
En#lis$ access t t$e !ull ran#e ! nrms recently made available !r t$er lan#ua#es. 9inally,
we intrduce a new measure ! wrd !re(uency, t$e )i*! scale, w$ic$ we $*e will st* t$e
current misunderstandin#s ! t$e wrd !re(uency e!!ect.
Page 3
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 3/36
3
SUBTLEX-UK:
A new and improved word frequency database for Britis En!"is
Wrd !re(uency ar#uably is t$e mst im*rtant variable in wrd rec#nitin researc$
@Brysbaert, Buc$meier, >nrad, Jacbs, BAlte, BA$l, 2C11aD. Wrds t$at are !ten
encuntered are *rcessed !aster t$an wrds t$at are rarely encuntered. 9i#ure 1 s$ws
t$e curse ! t$e wrd !re(uency e!!ect. ?t includes mean standardised reactin times @/
valuesD !r sam*les ! 1CCC wrds #in# !rm an avera#e !re(uency ! .C8 *er millin wrds
@a l#1C value ! /1.2D t an avera#e !re(uency ! nearly 1,CCC *er millin wrds @a l#1C
value ! nearly 3.CD. $e reactin times cme !rm t$e En#lis$ -e:icn Pr=ect @E-PF red
circlesF Balta, Ga*, >rtese, Hutc$isn, Kessler, -!tis, "eely, "elsn, &im*sn, reiman,
2CCD and t$e Britis$ -e:icn Pr=ect @B-PF blue circlesF Keuleers, -acey, +astle, Brysbaert,
2C12D, w$ic$ cntain le:ical decisin times t ver 5C t$usand wrds ! 0merican En#lis$
@E-PD r ver 27 t$usand mnsyllabic and disyllabic wrds ! Britis$ En#lis$ @B-PD. $e
wrd !re(uencies cme !rm t$e Britis$ "atinal >r*us @B">F available at
$tt*'www.il#arri!!.c.ubnc/readme.$tmlF c$eced n May 13, 2C13D, a 1CC millin wrd
cllectin ! sam*les ! mstly written and sme s*en lan#ua#e !rm a wide ran#e !
surces, cllected between 1;;1 and 1;;5 and desi#ned t re*resent a wide crss/sectin !
Britis$ En#lis$ at t$at time. 0nt$er database ! wrd !re(uency nrms !ten used !r
Britis$ En#lis$ is t$e >E-E le:ical database @Baayen, Pie*enbrc, %uliers, 1;;6D, based
n a cr*us ! 1.; millin wrds assembled aln# t$e same criteria as t$e B">.
/ / / / / / / / / / / / / / / / /
?nsert 9i#ure 1 abut $ere
/ / / / / / / / / / / / / / / / /
Page 4
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 4/36
5
+esearc$ in 0merican En#lis$ and t$er lan#ua#es $as su##ested t$at wrd !re(uencies
based n !ilm and televisin subtitles are better *redictrs ! wrd *rcessin# times t$an
wrd !re(uencies based n bs and t$er written surces @Brysbaert et al., 2C11aF
Brysbaert, Keuleers, "ew, 2C11bF Brysbaert "ew, 2CC;F >ai Brysbaert, 2C1CF >uets,
%le/"sti, Barbn, Brysbaert, 2C11F imitr*ulu, uIabeitia, 0vils, >rral,
>arreiras, 2C1CF 9errand, "ew, Brysbaert, Keuleers, Bnin, Met, 0u#ustinva, Pallier,
2C1CF Keuleers, Brysbaert, "ew, 2C1CF "ew, 9errand, ernis, Pallier, 2CCD. $is is an
im*rtant !indin#, because t$e mre variance can be e:*lained by wrd !re(uency t$e !ewer
t$er variables are needed t accunt !r wrd *rcessin# times. Brysbaert and >rtese
@2C11D, !r e:am*le, !und t$at wrd !amiliarity did nt e:*lain muc$ e:tra variance in
le:ical decisin times t mnsyllabic En#lis$ wrds w$en t$e &UB-E/U& subtitle
!re(uency measure was used @Brysbaert "ew, 2CC;D instead ! a cmmnly used,
utdated !re(uency measure based n a small cr*us ! written surces @KuLera 9rancis,
1;8D.
0lt$u#$ wrd !re(uency estimates based n 0merican subtitles can be used @and $ave
been usedD in Britis$ wrd rec#nitin researc$, sme *recisin is lst, because sme wrds
$ave a di!!erent s*ellin# @e.#., labr vs. laburD r a di!!erent meanin# @e.#., biscuits, *antsD
in t$e tw lan#ua#es. $e diver#ences between 0merican and Britis$ wrd usa#e im*ly t$at
Britis$ researc$ers s$uld limit t$eir researc$ t t$e wrds !ully s$ared amn# t$e lan#ua#es
i! t$ey use 0merican subtitle !re(uencies. Else, t$eir !indin#s ris verestimatin# t$e im*act
! nn/!re(uency variables, suc$ as a#e/!/ac(uisitin, wrd !amiliarity, wrd len#t$, r
similarity t t$er wrds. &ub*timal !re(uency estimates als increase t$e ris ! stimulus
Page 5
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 5/36
6
selectin errrs. $is will be t$e case w$en wrds must be selected n t$e basis ! !re(uency
in!rmatin @e.#., wrds $avin# di!!erent numbers ! clsely resemblin# wrds, s/called
rt$#ra*$ic nei#$burs, wit$ $i#$er !re(uenciesD r w$en wrds ! di!!erent cnditins
must be matc$ed n !re(uency @e.#., $i#$ly emtinal wrds vs. neutral wrdsD.
address t$e limitatins t$at researc$ers wrin# wit$ Britis$ En#lis$ are cn!rnted wit$,
we decided t cllect subtitle/based UK wrd !re(uency nrms. ?n additin, because we
were able t directly ca*ture t$e subtitles !rm a variety ! televisin *r#rams, !r t$e !irst
time we als cllected subtitle !re(uencies !rm c$annels s*eci!ically aimed at c$ildren.
Belw we describe t$e cllectin ! t$e data, t$e summary statistics calculated, and t$e !irst
validatin studies we ran.
Met$d
#orpus co""ection$ ?n line wit$ UK re#ulatins, since 2CC7 t$e Britis$ Bradcastin#
>r*ratin @BB>D subtitles all sc$eduled *r#rams n its main c$annels, t $el* t$e $earin#
im*aired.1 $ese subtitles are nt bradcasted t$ru#$ t$e main c$annel, but can be
su*erim*sed n t$e *r#ram by t$se w$ wis$ s @e.#., by usin# elete:tD. $ave t$e
widest *ssible ran#e ! lan#ua#e in*ut, we cllected t$e wrds and wrd *airs ! t$e
subtitles !rm nine c$annels @BB>1/5, BB> "ews, BB> Parliament, BB> H, >Beebies, and
>BB>D bradcasted ver a *erid ! t$ree years @January 2C1C / ecember 2C12D. ! t$ese
c$annels, BB>1 is t$e mst **ular and e:tensive @aimed at all ty*es ! audiencesD. $e
t$er c$annels $ave mre limited $urs. ! !urt$er interest is t$at t$e >Beebies c$annel is
1 n t$e basis ! anecdtal evidence we can add t$at t$ese subtitles are als a**reciated by viewers wit$
En#lis$ as secnd lan#ua#e.
Page 6
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 6/36
8
meant !r *resc$l c$ildren @C N 8 yearsD and t$e >BB> c$annel !r *rimary sc$l c$ildren
@8 / 12 yearsD. $is allwed us t cm*ile !re(uency nrms !r t$ese #ru*s.
"twit$standin# t$e *rvisins relatin# t O!air dealin# *rvided under sectin 2; >*yri#$t
esi#ns Patents 0ct 1;77, t$e !ull te:tual cntent ! t$e relevant subtitles were nt stred
r re*rduced !r t$e *ur*se ! t$is researc$. 0 cunt ! individual wrds and cnsecutive
wrds was undertaen, btainable !rm *ublic transmissins. $e met$d em*lyed des
nt detract !rm r t$erwise undermine t$e value ! t$is evaluative wr.
Te%t c"eanin!$ $e bradcasts were cleaned semi/autmatically !r dubles @*r#ram
re*eatsD and subtitle/related in!rmatin nt bradcasted t t$e viewers. 0ls t$e *arts !
t$e subtitles nt related t t$e cnversatin were eliminated @e.#., t$e wrds QsilenceR r
Qt$underR t describe t$e n#in# sceneF t$ese are usually *resented in u**ercase, a
di!!erent !nt r clur in t$e subtitleD. 0!ter t$e cleanin# we btained a ttal ! 2C1.
millin wrds, cmin# !rm 56,C;; di!!erent bradcasts. $is is lar#er t$an t$e t$er e:istin#
subtitle cr*ra @Brysbaert "ew, 2CC;F >ai Brysbaert, 2C1CF >uets et al., 2C11F
imitr*ulu et al., 2C1CF Keuleers et al., 2C1CD2, and allwed us t calculate mre *recise
Parts/!/&*eec$ de*endent !re(uencies and wrd bi#rams.
Wrd !re(uency measures
2 Brysbaert and "ew @2CC;D re*rted t$at t$e wrd ty*e !re(uencies t$emselves s$w little di!!erence nce t$e
cr*us cntains 3C millin wrds, a !indin# t$at was re*licated in t$e *resent analyses.
Page 7
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 7/36
&ord frequency counts$ 0 !irst decisin t be made was w$at t d wit$ $y*$enated wrds.
?n Britis$ En#lis$ wrds are !ten $y*$enated w$en t$ey !unctin as ad=ectives. &, a *tin
t$at saves lives can be described as Qa li!e/savin# *tinR. $is *$rase culd be cunted as
cnsistin# ! t$ree wrd ty*es @a, li!e/savin#, *tinD r !ur wrd ty*es @a, li!e, savin#,
*tinD. $e *rblem was *articularly relevant !r t$e BB> subtitles, because nearly ne ut
! !ur wrd ty*es cntained a $y*$en in t$e !irst analysis ! t$e data. $e vast ma=rity !
t$ese $y*$enated entries were ! lw !re(uency @less t$an 1CC bservatins n a ttal !
2CC millin wrdsD. Because t$ere are n a *riri cnsideratins abut $w t $andle t$is
!indin# @als because t$ere is (uite sme individual variability in t$e use ! $y*$ensF
Ku*erman Bertram, 2C13D, we decided t use a *ra#matic criterin and led at w$ic$
wrd !re(uencies crrelated mst wit$ t$e 27 t$usand le:ical decisin times ! t$e B-P
@Keuleers et al., 2C12D. 0s t$is clearly !avured t$e de$y*$enated wrd !re(uencies @a
di!!erence in variance e:*lained ! 6SD, we decided t de$y*$enate t$e data be!re
cuntin# t$e wrds.3
$e de$y*$enated subtitles resulted in a ttal ! 332,;7 di!!erent wrd ty*es !r a ttal !
2C1,12,23 tens. ! t$ese, 31,387 ty*es were in t$e >Beebies subtitles wit$ a ttal !
6,78C,26 tens, and C,66 ty*es were in t$e >BB> subtitles wit$ a ttal ! 13,855,186
tens. Because t$e vast ma=rity ! wrds bserved in a sin#le bradcast were ty*s and
t$er nnwrd/lie structures @lie Qaaaarrrr#$R r QRD, we decided t tae ut all
entries bserved in a sin#le bradcast nly. $is reduced t$e number ! ty*es t 16;,236
3
e$y*$enatin als ccurs in autmatic te:t *arsers, suc$ as >-0W& and t$e &tan!rd *arser @t be describedlaterD. Because t$e &tan!rd *arser de$y*$enates mre wrds t$an >-0W&, t$e utcme ! t$is *arser
ut*er!rmed t$at ! >-0W& n t$e raw cr*us, but n ln#er n t$e de$y*$enated cr*us.
Page 8
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 8/36
7
wit$ a ttal ten cunt ! 2C1,336,837 !r t$e cm*lete cr*us, 6,757,C73 !r t$e >Beebies
subcr*us @2,238 ty*esD, and 13,812,27 !r t$e >BB> subcr*us @67,8;1 ty*esD.
A standardised frequency measure: Te 'ipf sca"e$ 0lt$u#$ t$e !re(uency cunts are t$e
mst versatile measure @as will becme clear later, w$en we calculate all ty*es ! derived
measuresD, t$ey $ave ne bi# disadvanta#e. $e inter*retatin ! t$e !re(uency measure
de*ends n t$e sie ! t$e cr*us. $ere!re, aut$rs $ave led !r a standardised
!re(uency measure, an inde: wit$ t$e same inter*retatin acrss all cr*ra cllected.
$us !ar, t$e mst **ular standardised !re(uency measure $as been !re(uency *er millin
wrds @!*mwD. ?t is t$e !re(uency measure we made available in ur *revius wr n
subtitle !re(uencies as well. Hwever, we increasin#ly nticed t$at t$is measure leads t an
incrrect understandin# ! t$e wrd !re(uency e!!ect.
Because t$eir cr*us cntained nly 1 millin wrds, t$e lwest value in t$e wrd
!re(uencies made available by Kucera 9rancis @1;8D was 1 !*mw. $is cntributed t t$e
assum*tin t$at 1 !*mw is t$e lwest *ssible !re(uency. bviusly, t$is is n ln#er t$e
case !r lar#er cr*ra. 0s it $a**ens, abut 7CS ! t$e wrd ty*es in &UB-E/UK $ave a
!re(uency ! less t$an 1 !*mw @i.e., less t$an 2CC ccurrences in all bradcastsD. &ecnd, as
s$wn in 9i#ure 1, nearly $al! ! t$e wrd !re(uency e!!ect is situated belw 1 !*mw and
t$ere is very little di!!erence abve 1C !*mw. $e !re(uency e!!ect ! le:ical decisin times
between .1 !*mw and 1 !*mw is e(ual t r lar#er t$an t$e e!!ect between 1 !*mw and 1C
!*mw. 0 l#arit$mic trans!rmatin ! !re(uency measures, as is rutinely *er!rmed,
alleviates t$is *rblem. Hwever, t$e l#arit$ms ! !*mw becme ne#ative !r !re(uencies
Page 9
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 9/36
;
lwer t$an 1 @as a#ain s$wn in 9i#ure 1D, w$ic$ unin!rmed users tend t avid. Because !
t$ese *r*erties, !*mw as a standardied measure *uts users n t$e wrn# !t.
mae t$e wrd !re(uency e!!ect easier t understand, ne needs a scale wit$ t$e
!llwin# *r*erties'
1. ?t s$uld be a l#arit$mic scale @e.#., lie t$e decibel scale ! sund ludnessD.
2. ?t s$uld $ave relatively !ew *ints, wit$ut ne#ative values @e.#., lie a ty*ical
-iert ratin# scale, !rm 1 t D.
3. $e middle ! t$e scale s$uld se*arate t$e lw/!re(uency wrds !rm t$e $i#$/
!re(uency wrds.
5. $e scale s$uld $ave a strai#$t!rward unit.
nce we nw w$at t$e scale s$uld l lie, it is nt s di!!icult t cme u* wit$ a #d
trans!rmatin. ?n *articular, w$en we tae t$e l#1C ! t$e !re(uency *er billin wrds
@rat$er t$an !*mwD, t$e scale !ul!ils t$e !irst t$ree re(uirements. meet t$e last
re(uirement, we *r*se t call t$e new scale t$e Zipf scale, a!ter t$e 0merican lin#uist
%er#e Kin#sley )i*! @1;C2N1;6CD w$ !irst t$ru#$ly analysed t$e re#ularities ! wrd
!re(uency distributin and !rmulated a law @)i*!, 1;5;D w$ic$ was later named a!ter $im.
$e unit t$en becmes t$e )i*!.
$e )i*! scale is a l#arit$mic scale, lie t$e decibel scale ! sund intensity, and ru#$ly
#es !rm 1 @very lw !re(uency wrdsD t 8 @very $i#$ !re(uency cntent wrdsD r @a !ew
!unctin wrds, *rnuns, and verb !rms lie Q$aveRD. $e calculatin ! )i*! values is easy
as it e(uals l#1C@!re(uency *er billin wrdsD r l#1C@!re(uency *er millin wrdsD 4 3. &,
Page 10
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 10/36
1C
a )i*! value ! 1 crres*nds t wrds wit$ !re(uencies ! 1 *er 1CC millin wrds, a )i*!
value ! 2 crres*nds t wrds wit$ !re(uencies ! 1 *er 1C millin wrds, a )i*!/value ! 3
crres*nds t wrds wit$ !re(uencies ! 1 *er millin wrds, and s n.
able 1 summarises t$e in!rmatin. ?t als $el*s t clear ne mre misunderstandin# abut
wrd !re(uencies amn# *syc$lin#uists, namely t$at wrds wit$ !re(uencies belw 1 !*mw
are t uncmmn t be nwn. $ere are $undreds ! derived and in!lected wrd !rms
and even lemmas wit$ !re(uencies ! lwer t$an .1 !*mw t$at are *er!ectly nwn, as can
be seen in able 1. >ntent wrds rarely $ave a )i*! value $i#$er t$an 8, s t$at !r mst
*ractical researc$ *ur*ses, t$e )i*!/scale will be a scale !rm 1 t 8 wit$ t$e ti**in# *int
!rm lw/!re(uency t $i#$/!re(uency between 3 and 5.
/ / / / / / / / / / / / / / / / /
?nsert able 1 abut $ere
/ / / / / / / / / / / / / / / / /
ne mre additin t$at is ! interest !r t$e )i*! scale is t$e *ssibility t include wrds wit$
!re(uency cunts ! C @i.e., wrds nt bserved in t$e cr*usD. 0lt$u#$ t$ese wrds are less
cmmn in lar#e cr*ra, t$ey are by n means absent. &uc$ wrds *se a *rblem !r t$e
)i*! scale as a result ! t$e l#arit$mic trans!rmatin @#iven t$at t$e l#arit$m ! C is minus
in!inityD. ?n a recent review ie*endaele and Brysbaert @2C13D cncluded t$at t$e best way
t deal wit$ C wrd !re(uencies is t$e -a*lace trans!rmatin. +at$er t$an wrin# wit$ t$e
raw !re(uency cunts, ne wrs wit$ t$e !re(uency cunts 4 1. $is means t$at all
!re(uency values are @sli#$tlyD elevated. $e *r*er a**licatin ! t$e al#rit$m als im*lies
t$at t$e t$eretical sie ! t$e cr*us is a little lar#er t$an t$e actual sie, because ne is
Page 11
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 11/36
11
leavin# rm !r " unbserved wrd ty*es wit$ !re(uency 1. " is t$e number ! wrd ty*es
in t$e !re(uency list. &, !r t$e !ull cr*us t$e -a*lace trans!rmatin assumes t$at t$ere
are 16;,236 unbserved wrd ty*es e:tra in t$e lan#ua#e, all wit$ a !re(uency ! 1.
?n *ractice, t$e !llwin# e(uatin is needed t calculate t$e )i*! values n t$e basis ! t$e
!re(uency cunts ! t$e ttal cr*us'
0.3159.336.201
1_10log +
+
+=
count frequency Zipf
$e values in t$e denminatr are t$e sie ! t$e cr*us in millins *lus t$e number ! wrd
ty*es in millins. &*eci!ically, t$e )i*! value ! an unbserved wrd ty*e will be'
696.0.3159.336.201
1010log =+
+
+= Zipf
$e )i*! value ! a wrd ty*e bserved nce in t$e cm*lete cr*us will be .;;F t$at ! a
wrd bserved 1C times will be 1.3, and s n.
calculate t$e )i*! values !r t$e >Beebies cr*us, we $ave t use t$e !llwin# e(uatin'
0.3027.848.5
1_10log +
+
+=
CBeebiescount frequency Zipf
9r t$e >BB> subcr*us t$e e(uatin is
0.3059.612.13
1_10log +
+
+=
CBBC count frequency Zipf
Page 12
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 12/36
12
&*eci!ically, t$is means t$at wrds wit$ a C !re(uency in t$e >Beebies cr*us #et a )i*! value
! 2.231F t$se wit$ a C !re(uency in t$e >BB> cr*us #et a )i*! value ! 1.785. $e $i#$er
values !r unbserved wrd ty*es are due t t$e smaller sies ! t$e cr*ra and als mean
t$at ne s$uld be sensible in t$eir use. $ere is n *int in blindly usin# t$ese values !r all
missin# wrds in t$e lists, as ne assumes t$at t$e missin# wrds are nwn t *resc$lers
@>BeebiesD r *rimary sc$l c$ildren @>BB>D. 0s we will see belw, t$is may be ne reasn
w$y t$e c$ild$d !re(uencies are nt crrelatin# very well wit$ t$e le:ical decisin times !
t$e Britis$ -e:icn Pr=ect w$en calculated acrss all wrds.
#ive readers a better !eelin# !r t$e )i*! scale, able 2 tabulates t$e summary statistics !
t$e )i*! values used in tw classic wrd !re(uency studies in Britis$ En#lis$ @Mnsell, yle,
Ha##ard, 1;7;F Mrrisn Ellis, 1;;6D. w interestin# bservatins can be made. 9irst,
t$e standard deviatins ! t$e )i*! values are similar !r t$e $i#$ and t$e lw !re(uency
wrds @as t$ey s$uld beD, w$ereas !r !*mw t$e standard deviatins are cnsiderably lar#er
in t$e cnditins wit$ $i#$ !re(uency wrds t$an in t$e cnditins wit$ lw !re(uency
wrds. &ecnd, we see t$at in bt$ studies t$e lw !re(uency wrds $ad )i*! values abve 3,
because t$e researc$ers derived t$eir !re(uency estimates !rm t$e Kucera and 9rancis list
and cnsidered 1 !*mw as t$e lwer end ! t$e !re(uency ran#e. Wit$ t$e availability !
mre re!ined wrd !re(uency measures, we $*e t$at in t$e !uture mre use will be made
! wrds wit$ )i*! values belw 3. 0s 9i#ure 1 indicates, t$is is a sensible t$in# t d, as in
t$is ran#e t$e wrd !re(uency e!!ect is at its strn#est. 9urt$ermre, abut 7CS ! t$e wrd
ty*es in &UB-E/U& $ave )i*! values belw 3 @i.e., belw 1 !*mwD. &, t$ere is muc$ mre
c$ice at t$e lw end ! t$e distributin t$an at t$e $i#$ end. ?n ur current estimate, lw/
Page 13
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 13/36
13
!re(uency wrds ideally $ave a mean )i*! value at @r belwD 2.6 and $i#$/!re(uency wrds
$ave a mean )i*! value ! 5.6.
/ / / / / / / / / / / / / / / / /
?nsert able 2 abut $ere
/ / / / / / / / / / / / / / / / /
#onte%tua" diversity$ 0delman, Brwn, and Tuesada @2CC8F see als 0delman Brwn,
2CC7F Perea, &ares, >mesana, 2C13F Ga*, an, Pe:man, Har#reaves, 2C11D ar#ued t$at
nt s muc$ t$e !re(uency ! ccurrence ! a wrd matters, but t$e number ! cnte:ts in
w$ic$ t$e wrd a**ears. Wrds nly encuntered in a small number ! cnte:ts @say, a wrd
wit$ a !re(uency ! 1CC ccurrin# in ne r tw televisin e*isdesD will be mre di!!icult t
*rcess t$an e(ually !re(uent wrds encuntered in a variety ! cnte:ts @e.#., a wrd wit$
a !re(uency cunt ! 1CC used in 7C di!!erent bradcastsD. 0 #d *r:y !r cnte:tual
diversity @>D is t$e number ! televisin *r#rams!ilms @r t$e *ercenta#e !
*r#rams!ilmsD in w$ic$ t$e wrd a**ears. Brysbaert and "ew @2CC;D indeed bserved t$at
l#@>D e:*lained u* t 5S ! variance mre in le:ical decisin times t$an l#@!re(uencyD.
Part ! t$e advanta#e was met$dl#ical, $wever. w !actrs were invlved. 9irst, t$e
e!!ect ! l#@>D n +s is mre linear t$an t$e e!!ect ! l#@!re(uencyD, w$ic$ becmes !lat
!r $i#$ !re(uency wrds, as can be seen in 9i#ure 1. W$en nn/linear re#ressin analysis
was used, t$e di!!erence between > and !re(uency became smaller t$an 2S. 0nt$er *art
! t$e di!!erence was due t t$e !act t$at sme wrds ccurred wit$ very $i#$ !re(uency in a
!ew !ilms because t$ey were t$e names ! main c$aracters @e.#., arc$er, bay, brwnD. $e >
statistic is less in!luenced by t$ese instances t$an t$e !re(uency statistic.
Page 14
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 14/36
15
&till, t$e > measure seems t $ave added value. $ere!re, we *rvide t$is in!rmatin !r
t$e di!!erent cr*ra we used @!ull cr*us, >Beebies, >BB>D. $e values are available bt$ as
t$e ttal number ! televisin *r#rams in w$ic$ t$e wrd ccurred, and t$e *ercenta#e !
televisin *r#rams in w$ic$ t$e wrd was encuntered. 0s indicated abve, t$e ttal
number ! bradcasts in t$e cm*lete cr*us was 56,C;;. $e number ! bradcasts in
>Beebies was 5,75F in >BB> it was 5,757.5
(art-of-Speec )(oS* dependent frequencies$ 9r many *ur*ses it is #d t nw w$at
rles wrds *lay in sentences and t$e relative !re(uencies ! t$ese rles @Brysbaert, "ew,
Keuleers, 2C12D. $is enables researc$ers interested in nuns, !r instance, t limit t$eir
stimulus materials t wrds t$at are always @r mstlyD used as nuns. ?t als allws
researc$ers t nw w$et$er an in!lected wrd is used mre !ten as an ad=ective @e.#.,
a**allin#D r as a verb @e.#., *layedD. $is is im*rtant in!rmatin t decide w$ic$ wrds t
include in ratin# studies @e.#., Ku*erman, &tadt$a#en/%nale, Brysbaert, 2C12D.
P& !re(uencies can nly be btained a!ter t$e cr*us $as been *arsed @i.e., t$e sentences
bren dwn int t$eir cnstituent *artsD and ta##ed @i.e., t$e wrds #iven t$eir crrect
*art/!/s*eec$ in t$e sentenceD. 9r a ln# time t$is was virtually im*ssible #iven t$e
amunt ! wr invlved. Hwever, t$e devel*ment ! autmatic P& ta##ers $as made it
*ssible t #et a reasnably #d @t$u#$ nt *er!ectD utcme in reasnable time and at an
a!!rdable *rice. 9r a ln# time, t$e >-0W& ta##er devel*ed at t$e University ! -ancaster
was t$e #lden standard @available at $tt*'ucrel.lancs.ac.uclaws, c$eced n May 1,
2C13D. ?t was used !r t$e B"> cr*us and we als used it !r ur &UB-E/U& cr*us
5 $e reasn w$y t$ese numbers are very similar is t$at bt$ c$annels $ave a similar rtatin ! *r#rams wit$
re*eats a!ter a rat$er s$rt *erid ! time.
Page 15
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 15/36
16
@Brysbaert et al., 2C12D. Hwever, in recent years t$e &tan!rd ta##er @initial versin'
utanva, Klein, Mannin#, &in#er, 2CC3F latest u*date available at $tt*'www/
nl*.stan!rd.edus!twarele:/*arser.s$tml, c$eced n May 1, 2C13D $as becme a
wrt$y cm*etitr. 0s it $a**ens, t$e utcme ! t$e !irst analyses wit$ t$e &tan!rd ta##er
crrelated mre wit$ t$e B-P wrd *rcessin# times t$an t$e utcme ! t$e >-0W& ta##er.
0s indicated in !tnte 2, t$is was due t t$e !act t$at t$e &tan!rd ta##er is mre
cnsistent in de$y*$enatin# wrds t$an >-0W&. W$en t$e subtitles were cleared !
$y*$ens be!re runnin# t$e ta##ers, bt$ #ave cm*arable ut*ut.
0nt$er advanta#e ! t$e &tan!rd s!tware6 is t$at it #ives t$e mst liely lemma
assciated wit$ an in!lected !rm. $e lemmatisatin is based n an al#rit$m devel*ed by
Minnen, >arrll, and Pearce @2CC1D. ?t wrs n tw main *rinci*les. 9irst, it ls u*
w$et$er a wrd !rm is *resent in t$e dictinary. ?! s, t$en t$e assciated lemma can be
read ut. ?! a wrd is lacin#, t$e mst liely lemma is allcated n t$e basis ! rules and
*attern cm*arisns @e.#., t$e mst liely lemma ! t$e stimulus QmartialisatinsR, identi!ied
as a nun, is QmartialisatinRF and t$e mst liely lemma ! t$e stimulus QMartialisR,
identi!ied as a name, is QMartialisRD. 0s discussed at #reater len#t$ in Brysbaert et al. @2C12D,
t$e utcme ! t$ese al#rit$ms is nt 1CCS crrect8 and, $ence, s$uld always be c$eced
by t$e user, certainly !r lw !re(uency wrds. Hwever, t$ey are a bi# ste* !rward @wit$
accuracy estimates ! ;S and $i#$erD and, t$ere!re, are *rvided in ur database. Mre
*recisely, we #ive in!rmatin abut t$e mst !re(uent P& assciated wit$ eac$ wrd ty*e,
6 0 disadvanta#e ! t$e &tan!rd ta##er is t$at in its de!ault mde it 0mericanies t$e s*ellin#s ! t$e wrds. &,
ne must be care!ul t c$an#e t$is w$en ne is wrin# wit$ Britis$ s*ellin#s.8 0 ntrius e:am*le is Q$rse!lyR, w$ic$ bt$ >-0W& and &tan!rd *arse as an adverb @ar#uably because t$e
wrd is nt in t$e *r#rams le:icn, s t$at t muc$ reliance is *ut n t$e end letters NlyD. ?rnically, &tan!rddes crrectly classi!y Q$rse!liesR as a nun assciated wit$ t$e lemma Q$rse!lyR @*resumably because t$e
end letters Nlies are mre liely t be assciated wit$ *lural nuns t$an wit$ t$er *arts/!/s*eec$D.
Page 16
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 16/36
18
t$e !re(uency ! t$is P& and t$e lemma assciated wit$ it, ne:t t all t$e *arts/!/s*eec$
assciated wit$ t$e wrd ty*e and t$eir res*ective !re(uencies. Because ! t$e
lemmatisatin and because t$e ut*ut was as #d as t$at ! >-0W&, t$e data *resented in
t$e &UB-E/UK database are based n t$e &tan!rd *arser and ta##er. 9i#ure 2 #ives an
e:am*le ! t$e ut*ut. 0ll !re(uencies are #iven as raw !re(uency cunts based n t$e
entire cr*us, because t$is value is t$e mst in!rmative t calculate derived statistics !rm
@e.#., t$e *ercenta#e use as t$e dminant P&D.
/ / / / / / / / / / / / / / / / /
?nsert 9i#ure 2 abut $ere
/ / / / / / / / / / / / / / / / /
Bi!ram frequencies$ Because e:tra in!rmatin can be btained !rm wrd cmbinatins
@0rnn &nyder, 2C1CF Baayen, Milin, 9ili*vic urdevic, Hendri:, Marelli, 2C11F &iyanva/
>$anturia, >nlin, van Heuven, 2C11D, we als cllected wrd bi#ram !re(uencies in t$e
entire cr*us @i.e., t$e !re(uency wit$ w$ic$ wrd *airs were bservedD. $is resulted in
ver 1.6 millin lines ! cnsecutive wrd *airs bserved in t$e cr*us. 9r eac$ *air we #ive
in!rmatin abut t$e number ! times it was bserved, t$e symbls written between t$e
wrds @s*ace, *unctuatin mar, $y*$en, ...D and t$eir res*ective !re(uencies. $is maes it
*ssible !r everyne t calculate interestin# additinal metrics. 9r instance, it allwed us
t add t$e 7 $y*$enated wrds wit$ a !re(uency cunt ! mre t$an 1CC @!*wm .6D t
t$e database. ?t als allwed us t warn researc$ers w$en a cm*und wrd is mre liely
t be written as tw se*arate wrds t$an as a sin#le wrd @!r instance, t$e wrd Qmaeu*R
is bserved 3C7 times in t$e subtitles @)i*! 3.17D, but t$e s*ellin#s Qmae/u*R and Qmae
$ese !re(uencies were nt subtracted !rm t$e !re(uencies ! t$e individual wrds, under t$e assum*tin
t$at t$e cm*nent wrds ! a $y*$enated wrd #et c/activated u*n seein# t$e $y*$enated wrd.
Page 17
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 17/36
1
u*R $ave a cmbined !re(uency ! 7,;;7, main# Qmaeu*R a bad c$ice !r a lw !re(uency
wrdD.
>rrelatins wit$ le:ical decisin measures
%iven t$e ease wit$ w$ic$ wrd !re(uencies can be cllected nwadays, it is im*rtant t
c$ec w$et$er a new !re(uency measure adds smet$in# e:tra t t$e e:istin# nes. n t$e
basis ! *revius researc$, we can e:*ect t$is t be t$e case #iven t$e su*eririty ! subtitle/
based !re(uency estimates, but still it is #d t test t$is e:*licitly, als t mae sure n
calculatin errrs $ave been made. $e mst interestin# dataset is t$e B-P @Keuleers et al.,
2C12D, w$ic$ *rvides le:ical decisin reactin times and accuracy measures ! Britis$
students !r ver 27 t$usand mnsyllabic and disyllabic wrds. $e main cm*etitrs t
t$e &UB-E/UK wrd !re(uencies are t$e B"> !re(uencies, t$e >E-E !re(uencies, and t$e
&UB-E/U& !re(uencies. Wrds nt bserved in a cr*us were assi#ned a !re(uency ! C
and l# !re(uencies were t$e )i*! values @wit$ -a*lace trans!rmatinD. $e -a*lace
trans!rmatin was als used !r t$e > measure.
able 3 s$ws t$e results !r t$e accuracy data. 0s e:*ected t$e &UB-E/UK !re(uencies
ut*er!rm t$e t$er measures, mre s !r t$e > measure t$an !r t$e )i*! measure.
Because ! t$e lar#e number ! bservatins, t$e di!!erences are all $i#$ly si#ni!icant. 9r
instance, t$e t/value ! t$e Htellin#/Williams test @&tei#er, 1;7CD7 ! t$e di!!erence in
crrelatin wit$ &UB-E/UK @)i*!D and B"> @)i*!D e(uals 18.7 @d! 27,272, * V .CC1D. ?n
terms ! *ercenta#e variance e:*lained, t$e di!!erence is nearly 3S, w$ic$ is $i#$ #iven t$at
7 0n easy intrductin t t$e test and an E:cel !ile t calculate t$e e:act values is available n t$e website
$tt*'crr.u#ent.bearc$ives658.
Page 18
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 18/36
17
many variables e:*lain less t$an 1S ! variance, nce t$e e!!ects ! wrd !re(uency, wrd
len#t$ and similarity t t$er wrds are *artialled ut @Brysbaert >rtese, 2C11F Brysbaert
et al., 2C11aF Ku*erman et al., 2C12D.
/ / / / / / / / / / / / / / / / /
?nsert able 3 abut $ere
/ / / / / / / / / / / / / / / / /
?nterestin#ly, t$e crrelatins wit$ t$e c$ild$d !re(uencies are muc$ lwer, in *articular
t$e crrelatin wit$ t$e >Beebies !re(uencies @*resc$l c$ildrenD. w reasns !r t$is are
t$e smaller sies ! t$e cr*ra @includin# t$e many missin# wrds nt nwn t c$ildren
but #iven rat$er $i#$ )i*! estimatesD and t$e !act t$at t$e verall &UB-E/UK !re(uencies
include t$e subtitles !rm >Beebies and >BB> televisin *r#rams @almst 1CS ! t$e ttal
&UB-E/UKD.
able 5 s$ws t$e crrelatins !r t$e reactin times @+sD t t$e wrds. Because +s are
nly interestin# w$en t$e wrds are nwn, we set *ercenta#e accuracy t 88S @"
2C,66D. ery muc$ t$e same *icture a**ears, wit$ su*erir *er!rmance !r t$e &UB-E/
UK measures @> sli#$tly mre s t$an )i*!D.
/ / / / / / / / / / / / / / / / /
?nsert able 5 abut $ere
/ / / / / / / / / / / / / / / / /
mae sure t$at t$e $i#$er crrelatins between &UB-E/UK and t$e B-P measures t$an
between &UB-E/U& and B-P were due t lan#ua#e cn#ruency and nt t t$e better
Page 19
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 19/36
1;
(uality ! &UB-E/UK verall, we ran similar analyses ! t$e E-P data, w$ic$ were cllected
n 0merican students. 0s can be seen in able 6, t$e di!!erence between &UB-E/UK and
&UB-E/U& indeed $as t d wit$ di!!erences in wrd use between t$e tw lan#ua#es
rat$er t$an wit$ t$e in$erent (ualities ! t$e !re(uency lists. W$ereas t$e &UB-E/UK
!re(uencies are better !r t$e Britis$ B-P data @see ables 3 and 5D, t$e &UB-E/U& data are
better !r t$e 0merican E-P data @able 6D.
/ / / / / / / / / / / / / / / / /
?nsert able 6 abut $ere
/ / / / / / / / / / / / / / / / /
>rrelatins wit$ t$e >$ildrens Printed Wrd atabase @>PWD
$e best e:istin# Britis$ database ! wrd !re(uencies !r c$ildren is t$e >$ildrens Printed
Wrd atabase @>PWF available at $tt*'www.esse:.ac.u*syc$l#yc*wdF c$eced n
May 21, 2C13D. ?t includes t$e !re(uencies wit$ w$ic$ 12,1;3 di!!erent wrd ty*es are
bserved in 1C11 bs @;;6,;2 tensD !r 6/; year ld c$ildren in t$e UK @Mastersn,
&tuart, i:n, -ve=y, 2C1CD. We culd dwnlad data !r ;86; wrd ty*es !rm t$e
database, ;126 ! w$ic$ were als in t$e &UB-E/UK list @t$e nes nt in t$e list were
mainly #enitive !rms, $y*$enated !rms, and numbersD. able 8 #ives t$e crrelatins
between l# >PW !re(uencies and varius &UB-E/UK !re(uencies !r t$e ;126 s$ared
wrd ty*es. 0s can be seen, t$e crrelatins are reasnably $i#$, in *articular wit$ t$e
>Beebies wrd !re(uencies. $e Htellin#/Williams test indicated si#ni!icant di!!erences
between t$e >Beebies !re(uencies and t$e t$er !re(uencies @e.#., di!!erence between
>Beebies and >BB>, t@;122D 16.8, * V .CC1D. $is cn!irms t$at t$e &UB-E/UK c$ildren
Page 20
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 20/36
2C
!re(uencies are an interestin# additin t t$e >PW !re(uencies and can be used t study
!re(uency tra=ectries !rm c$ild$d t adult$d; @-t Bnin, 2C13D.
/ / / / / / / / / / / / / / / / /
?nsert able 8 abut $ere
/ / / / / / / / / / / / / / / / /
iscussin
?n t$is *a*er we *resented a new database ! wrd !re(uencies !r Britis$ En#lis$, based n
televisin subtitles. n t$e basis ! ur *revius researc$, we e:*ected t$at t$ese
!re(uencies wuld better *redict wrd *rcessin# *er!rmance t$an wrd !re(uencies
based n written surces @in *articular, t$e Britis$ "atinal >r*usD. $is indeed turned ut
t be t$e case, w$en we tried t *redict t$e le:ical decisin times and accuracies ! t$e
Britis$ -e:icn Pr=ect @ables 3 and 5D. $e Britis$ subtitle !re(uencies were als better t
*redict t$e B-P data t$an t$e 0merican subtitle !re(uencies, but t$ey were in!erir t
accunt !r t$e E-P data, in line wit$ t$e bservatin t$at wrd usa#e is nt cm*letely t$e
same in Britis$ and 0merican En#lis$. $e e:tra variance accunted !r amunted t 3/6S,
w$ic$ is cnsiderable #iven t$at many variables e:*lain less t$an 1S ! t$e variance nce
t$e e!!ects ! wrd !re(uency, len#t$, and similarity t t$er wrds are *artialed ut
@Brysbaert >rtese, 2C11F Brysbaert et al., 2C11aF Ku*erman et al., 2C12D.
W$ile analysin# t$e !indin#s, we were nce a#ain struc by $w misleadin# t$e standardised
wrd !re(uency measure !*mw @!re(uency *er millin wrdsD is t understand t$e wrd
; &UB-E/UK !re(uencies nt includin# c$ild$d !re(uencies can easily be btained by subtractin# t$e
>Beebies and >BB> !re(uency cunts !rm t$e ttal !re(uency cunts.
Page 21
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 21/36
21
!re(uency e!!ect. $ere!re, we *r*sed an alternative, t$e )i*! scale, w$ic$ is better
suited t t$e use ! wrd !re(uencies in *syc$l#ical researc$. $is scale #es !rm sli#$tly
less t$an 1 t sli#$tly mre t$an and can easily be inter*reted as !llws' alues ! 3 and
less are lw/!re(uency wrds, values ! 5 r mre are $i#$/!re(uency wrds. Wrds nt in
&UB-E/UK #et a )i*! value ! .8;8 w$en t$e !re(uencies are based n t$e cm*lete
cr*us, 1.785 w$en t$e >BB> !re(uencies are used, and 2.231 w$en t$e >Beebies
!re(uencies are used. $e di!!erences in minimal values are caused by t$e di!!erences in
cr*us sie and a#ree wit$ t$e !act t$at missin# wrds ! interest in >Beebies r >BB> are
liely t be mre !amiliar t$an wrds nt !und in t$e entire cr*us.
?n additin t t$e wrd !re(uencies, t$e new database !!ers t$er in!rmatin, w$ic$ will
allw Britis$ researc$ers t d cuttin#/ed#e investi#atins. $ese are'
/ Part/!/&*eec$ related !re(uencies, w$ic$ mae it *ssible !r researc$ers t better
cntrl t$eir stimulus materials,
/ 0 measure ! cnte:tual diversity @>D, w$ic$ is *articularly interestin# t *redict
w$ic$ wrds will be nwn and w$ic$ nt @cm*are ables 3 and 5D,
/ Wrd !re(uencies in materials aimed at very yun# @*resc$lD and yun# @*rimary
sc$lD c$ildren,
/ ?n!rmatin abut wrd bi#rams.
0vailability
Page 22
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 22/36
22
$e &UB-E/UK data are available in t$ree easy t use !iles. $e !irst ne @&UB-E/UKXallD
is a 332,;77 : 16 matri: cntainin# in!rmatin ! all wrd ty*es @includin# numbersD
encuntered in t$e de$y*$enated subtitles. $e 16 clumns #ive in!rmatin abut'
/ $e s*ellin# ! t$e wrd ty*e @&*ellin#D,
/ $e number ! times t$e wrd $as been cunted in all subtitles @9re(D,
/ $e number ! times t$e wrd started wit$ a ca*ital @>a*it9re(D,
/ $e *ercenta#e ! bradcasts cntainin# t$e wrd ty*e in all subtitles @>D,
/ $e number ! bradcasts cntainin# t$e wrd in all subtitles @>>untD,
/ $e mst !re(uent *art/!/s*eec$ ! t$e wrd @mP&D,
/ $e number ! times t$is dminant Ps was bserved @mPs9re(D,
/ $e lemma assciated wit$ t$e dminant Ps @m-emmaPsD,
/ $e number ! times t$is lemma was bserved in all subtitles @m-emmaPs9re(D,
/ $e summed !re(uencies ! all t$e times t$is lemma was bserved irres*ective ! t$e
P& @m-emmaPstal9re(D,
/ 0ll *arts/!/s*eec$ taen by t$e wrd ty*e @0llPsD,
/ $e res*ective !re(uencies ! t$ese P& @0llPs9re(D,
/ 0nd t$e assciated lemma in!rmatin @0ll-emmaPs, 0ll-emmaPs9re(,
0ll-emmaPstal9re(D.
$e secnd !ile @&UB-E/UKD cntains mre in!rmatin abut t$e 18C,C22 wrd ty*es
@16;,236 sin#le wrds and 7 $y*$enated wrdsD w$ic$ are bserved in mre t$an ne
bradcast and w$ic$ nly cntain letter in!rmatin @i.e., n di#its r nn/al*$anumerical
symblsD. $is !ile is t$e !ile mst *syc$lin#uistic researc$ers will want t use. ?t $as 2
clumns, cntainin#'
Page 23
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 23/36
23
/ $e wrd ty*e,
/ $e !re(uency cunts in all subtitles, t$e >Beebies subtitles, t$e >BB> subtitles, and
t$e Britis$ "atinal cr*us,
/ $e )i*! values assciated wit$ t$e varius !re(uencies,
/ $e > cunts and *ercenta#es in t$e t$ree &UB-E cr*ra,
/ $e dminant P&, its assciated lemma, and t$eir !re(uencies,
/ 0ll t$e P& and !re(uencies ! t$e wrd,
/ $e !re(uency ! t$e wrd startin# wit$ a ca*ital,
/ W$et$er t$e lwercase s*ellin# ! t$e wrd ty*e was acce*ted by a UK wrd s*ell
c$ecer @UKD, a U& wrd s*ell c$ecer @U&D, bt$ s*ell c$ecers @UKU&D, r nne @D1C
.
$is is an interestin# clumn w$en wrds must be selected and ne wants t avid
t$e inclusin ! names r t$er uninterestin# entries.
/ W$et$er t$e entry cntains a $y*$en @c!. t$e 7 added entries wit$ $y*$ensD,
/ W$et$er t$e entry $as ant$er $m*$nic entry. $is is interestin# t !ind
$m*$nes, but als t mae sure selected lw !re(uency wrds d nt $ave a
$i#$er !re(uency s*ellin# alternative.
/ W$et$er r nt t$e wrd ty*e $as been encuntered as a bi#ram in t$e subtitles,
/ $e !re(uency ! t$e bi#ram @summed acrss all ty*es ! intervenin# symbls, in
*articular blan s*aces, *unctuatin mars, and $y*$ensD.
9inally, t$e t$ird !ile @&UB-E/UKXbi#ramsD cntains in!rmatin abut wrd *airs. Because
t$is !ile $as nearly 2 millin lines ! in!rmatin, it cannt be made available as an E:cel !ile
@alt$u#$ we $ave suc$ a !ile wit$ all entries bserved 12 times r mreD. Eac$ line cntains
1C $e s*eller was t$e M& !!ice 2CC s*ellc$ecer, au#mented wit$ a list ! lemmas ne ! t$e aut$rs @MBD is
cm*ilin#.
Page 24
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 24/36
25
in!rmatin abut wrd 1 and wrd 2, t$e !re(uency ! t$e cmbinatin, t$e > cunt !
t$e cmbinatin, w$ic$ symbls were !und between t$e tw wrds wit$ w$ic$
!re(uencies. $is is im*rtant in!rmatin w$en researc$ers want t include transitin
*rbabilities in t$eir investi#atins, r w$en e:*ressins @e.#., b=ect names, *article verbsD
cnsist ! tw wrds.
$e !iles are available as su**lementary materials t t$e *resent article. $ey can als be
dwnladed !rm ur websites @$tt*'crr.u#ent.be, r
$tt*'www.*syc$l#y.nttin#$am.ac.usubtle:/uD, w$ere we in additin intend t
mae t$em available as nline cnsultable internet databases.
Page 25
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 25/36
26
+e!erences
0delman, J. &., Brwn, %. . @2CC7D. Mdelin# le:ical decisin' $e !rm ! !re(uency and
diversity e!!ects. Psychological Review, 115(1), 215/22.
0delman, J. &., Brwn, %. . 0., Tuesada, J. 9. @2CC8D. >nte:tual diversity, nt wrd
!re(uency, determines wrd namin# and le:ical decisin times. Psychological Science,
17 , 715N723.
0rnn, ?., &nider, ". @2C1CD. Mre t$an wrds' 9re(uency e!!ects !r multi/wrd *$rases.
Journal of Memory an !anguage, "#(1), 8/72.
Baayen, +. H., Milin, P., 9ili*vic urdevic, ., Hendri:, P. and Marelli, M. @2C11D, 0n
amr*$us mdel !r mr*$l#ical *rcessin# in visual cm*re$ensin based n
naive discriminative learnin#. Psychological Review, 11$, 537/572.
Baayen, +. H., Pie*enbrc, +. %uliers. -.@1;;6D. %he &'!' leical a*a+ase Y>/+MZ.
P$iladel*$ia' University ! Pennsylvania, -in#uistic ata >nsrtium.
Balta, .0., Ga*, M.J., >rtese, M.J., Hutc$isn, K.0., Kessler, B., -!tis, B., "eely, J.H.,
"elsn, .-., &im*sn, %.B., reiman, +. @2CCD. $e En#lis$ -e:icn Pr=ect.
ehavior Research Me*hos, -., 556/56;.
Brysbaert, M., Buc$meier, M., >nrad, M., Jacbs, 0.M., BAlte, J., BA$l, 0. @2C11aD. $e
wrd !re(uency e!!ect' 0 review ! recent devel*ments and im*licatins !r t$e
c$ice ! !re(uency estimates in %erman. 'perimen*al Psychology, 5$, 512/525.
Brysbaert, M. >rtese, M.J. @2C11D. t$e e!!ects ! sub=ective !re(uency and a#e !
ac(uisitin survive better wrd !re(uency nrms[ /uar*erly Journal of 'perimen*al
Psychology, "0, 656/66;.
Page 26
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 26/36
28
Brysbaert, M., ie*endaele, K. @2C13D. ealin# wit$ er wrd !re(uencies' 0 review ! t$e
e:istin# rules ! t$umb and a su##estin !r an evidence/based c$ice. ehavior
Research Me*hos, 05, 522/53C.
Brysbaert, M., Keuleers, E., "ew, B. @2C11bD. 0ssessin# t$e use!ulness ! %#le Bs
wrd !re(uencies !r *syc$lin#uistic researc$ n wrd *rcessin#. ron*iers in
Psychology, #2#7 .
Brysbaert, M., "ew, B. @2CC;D. Mvin# beynd Kucera and 9rancis' 0 critical evaluatin !
current wrd !re(uency nrms and t$e intrductin ! a new and im*rved wrd
!re(uency measure !r 0merican En#lis$. ehavior Research Me*hos, 01, ;/;;C.
Brysbaert, M., "ew, B., Keuleers, E. @2C12D. 0ddin# Part/!/&*eec$ in!rmatin t t$e
&UB-E/U& wrd !re(uencies. ehavior Research Me*hos, 00, ;;1/;;.
>ai, T. Brysbaert, M. @2C1CD. &UB-E/>H' >$inese wrd and c$aracter !re(uencies based
n !ilm subtitles. P!3S 34', 5, e17#..
>uets, 9., %le/"sti, M., Barbn, 0., Brysbaert, M. @2C11D. &UB-E/E&P' &*anis$ wrd
!re(uencies based n !ilm subtitles. Psicologica, -#, 133/153.
imitr*ulu, M., uIabeitia, J. 0., 0vils, 0., >rral, J., >arreiras, M. @2C1CD. &ubtitle/
based wrd !re(uencies as t$e best estimate ! readin# be$avir' $e case ! %ree.
ron*iers in psychology, 12#1$, 1/12.
9errand, -., "ew, B., Brysbaert, M., Keuleers, E., Bnin, P., Met, 0., 0u#ustinva, M.,
Pallier, >. @2C1CD. $e 9renc$ -e:icn Pr=ect' -e:ical decisin data !r 37,75C 9renc$
wrds and 37,75C *seudwrds. ehavior Research Me*hos, 0#, 577/5;8.
Keuleers, E., Brysbaert, M., "ew, B. @2C1CD. &UB-E/"-' 0 new !re(uency measure !r
utc$ wrds based n !ilm subtitles. ehavior Research Me*hos, 0#, 853/86C.
Page 27
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 27/36
2
Keuleers, E., -acey, P., +astle, K., Brysbaert, M. @2C12D. $e Britis$ -e:icn Pr=ect' -e:ical
decisin data !r 27,3C mnsyllabic and disyllabic En#lis$ wrds. ehavior
Research Me*hos, 00, 27/3C5.
KuLera, H., 9rancis, W. @1;8D. &ompu*a*ional analysis of presen*6ay merican 'nglish.
Prvidence, +?' Brwn University Press.
Ku*erman, ., Bertram, +. @2C13D. Mvin# s*aces' &*ellin# alternatin in En#lis$ nun/
nun cm*unds. !anguage an &ogni*ive Processes, @a$ead/!/*rintD, 1/27.
Ku*erman, ., &tadt$a#en/%nale, H., Brysbaert, M. @2C12D. 0#e/!/ac(uisitin ratin#s
!r 3C t$usand En#lis$ wrds. ehavior Research Me*hos, 00, ;7/;;C.
-t, B., Bnin, P. @2C13D. es !re(uency tra=ectry in!luence wrd identi!icatin[ 0 crss/
tas cm*arisn. %he /uar*erly Journal of 'perimen*al Psychology, ""(5), ;3/1CCC.
Mastersn, J., &tuart, M., i:n, M., -ve=y, &. @2C1CD. >$ildren\s *rinted wrd database'
>ntinuities and c$an#es ver time in c$ildren\s early readin# vcabulary. ri*ish
Journal of Psychology, 11(#), 221/252.
Minnen, %., >arrll, J., Pearce, . @2CC1D. 0**lied mr*$l#ical *rcessin# ! En#lis$.
4a*ural !anguage 'ngineering, 7(-), 2C/223.
Mnsell, &., yle, M.>., Ha##ard, P.". @1;7;D. E!!ects ! !re(uency n visual wrd
rec#nitin tass / W$ere are t$ey[ Journal of 'perimen*al Psychology2 8eneral,
11$, 53/1.
Mrrisn, >. M., Ellis, 0. W. @1;;6D. +les ! wrd !re(uency and a#e ! ac(uisitin in wrd
namin# and le:ical decisin. Journal of eperimen*al psychology9 !earning, memory,
an cogni*ion, #1(1), 118/133.
"ew, B., Brysbaert, M., ernis, J., Pallier, >. @2CCD. $e use ! !ilm subtitles t estimate
wrd !re(uencies. pplie Psycholinguis*ics, #$, 881/8.
Page 28
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 28/36
27
Perea, M., &ares, 0. P., >mesaIa, M. @2C13D. >nte:tual diversity is a main determinant
! wrd identi!icatin times in yun# readers. Journal of 'perimen*al &hil
Psychology . @a$ead ! *rint *ublicatinD
&iyanva/>$anturia, 0., >nlin, K., van Heuven, W. J. B. @2C11D. &eein# a P$rase] ime and
0#ain] Matters' $e +le ! P$rasal 9re(uency in t$e Prcessin# ! Multiwrd
&e(uences. Journal of 'perimen*al Psychology6!earning Memory an &ogni*ion,
-7(-), 8/75.
&tei#er, J.H. @1;7CD. ests !r cm*arin# elements ! a crrelatin matri:. Psychological
ulle*in, $7 , 256/261.
utanva, K., Klein, ., Mannin#, >. ., &in#er, G. @2CC3D. 9eature/ric$ *art/!/s*eec$
ta##in# wit$ a cyclic de*endency netwr. ?n Proceeings of *he #- &onference of
*he 4or*h merican &hap*er of *he ssocia*ion for &ompu*a*ional !inguis*ics on
:uman !anguage %echnology6;olume 1 @**. 13/17CD. 0ssciatin !r >m*utatinal
-in#uistics.
Ga*, M. J., an, &. E., Pe:man, P. M., Har#reaves, ?. &. @2C11D. ?s mre always better[ E!!ects
! semantic ric$ness n le:ical decisin, s*eeded *rnunciatin, and semantic
classi!icatin. Psychonomic ulle*in < Review, 1$(0), 52/6C.
)i*!, %.K. @1;5;D. :uman ehavior an *he Principle of !eas* 'ffor* . >ambrid#e,
Massac$usetts' 0ddisn/Wesley
Page 29
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 29/36
2;
9i#ure 1' $e wrd !re(uency e!!ect. Mean standardied le:ical decisin times @/scresD !r
sam*les ! 1CCC wrds as a !unctin ! l#1C wrd !re(uency *er millin wrds. $e red
circles re*resent data !rm t$e En#lis$ -e:icn Pr=ect @Balta et al., 2CCDF t$e blue circles
data !rm t$e Britis$ -e:icn Pr=ect @Keuleers et al., 2C12D. Wrd !re(uencies are based n
t$e 1CC millin wrds Britis$ "atinal >r*us @available at $tt*'www.natcr*.:.ac.uD.
&urce' Keuleers et al., 2C12, 9i#ure 5.
Page 30
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 30/36
3C
9i#ure 2' &creens$t ! t$e P& analysis. 9r eac$ wrd ty*e @in t$e clumn O&*ellin#D, t$e
mst !re(uent P& is determined, t$e assciated lemma, t$e number ! times t$is P& is
bserved in all &UB-E/UK subtitles, t$e ttal !re(uency ! t$e lemma in t$e subtitles, all
*arts/!/s*eec$ assciated wit$ t$e wrd ty*e, and t$e !re(uencies ! t$ese *arts/!/s*eec$
in all subtitles. 9rm t$is !i#ure, we see t$at accrdin# t t$e &tan!rd ta##er t$e wrd ty*e
Q!inaliseR is used mstly @185 timesD as a verb @assciated wit$ t$e lemma Q!inaliseRD, but als
ccasinally @8 timesD as a nun. $e ttal !re(uency ! t$e verb lemma Q!inaliseR @w$ic$ als
includes t$e !re(uencies ! t$e wrd ty*es Q!inalisesR, Q!inalisedR, and Q!inalisin#RD is 588.
Page 31
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 31/36
31
able 1' $e )i*! scale ! wrd !re(uency
$e )i*! scale is a wrd !re(uency scale #in# !rm 1 t . Wrds wit$ )i*! values ! 3 r
lwer are lw/!re(uency wrdsF wrds wit$ )i*! values ! 5 and $i#$er are $i#$/!re(uency
wrds. E:am*les are based n t$e &UB-E/UK wrd !re(uencies.
)i*! value !*mw E:am*les
1 .C1 anti!un#al, bien#ineerin#, !arsi#$ted, $areli*, *r!read
2 .1 airstream, dree*er, necwear, utsied, suns$ade
3 1 beanstal, crnerstne, dum*lin#, insatiable, *er*etratr
5 1C dirt, !antasy, mu!!in, !!ensive, transitin, wides*read
6 1CC basically, bedrm, drive, issues, *erid, s*t, wrse
8 1,CCC day, #reat, t$er, s$uld, smet$in#, wr, years
1C,CCC and, !r, $ave, ?, n, t$e, t$is, t$at, yu
Page 32
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 32/36
32
able 2' 9re(uencies used in tw classical studies ! t$e wrd !re(uency e!!ect, bt$ w$en
e:*ressed as !re(uency *er millin wrds and as )i*! values. Means and standard deviatins
@between bracetsD. 9re(uencies based n &UB-E/UK.
9*mw )i*!
Mnsell et al. @1;7;, E:*eriments 1/2D
-w !re(uency wrds @" 57D 2.12 @2.22D 3.16 @.3;D
Medium !re(uency wrds @" 57D 16.5C @1C.71D 5.C; @.2;D
Hi#$ !re(uency wrds @" 57D 75.86 @82.88D 5.7 @.5CD
Mrrisn Ellis @1;;6D
-w !re(uency wrds @" 25D 8.62 @5.81D 3.88 @.55D
Hi#$ !re(uency wrds @" 25D 188.C3 @187.5D 6.C @.3D
Early ac(uired wrds @" 25D 33.5; @35.7D 5.35 @.55D
-ate ac(uired wrds @" 25D ;.;1 @18.6D 3.83 @.66D
Page 33
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 33/36
33
able 3' >rrelatins between t$e varius !re(uency measures and t$e B-P accuracy data @"
27,276D. $e u**er *art s$ws t$e crrelatins. $e lwer *art s$ws t$e *ercenta#es !
variance accunted !r by nn/linear re#ressin analyses @lm/*rcedure in +, restricted cubic
s*lines wit$ 5 ntsD.
&UB-E/UK &UB-E/UKX> &UB-E/U& B"> >ele: >Beebies >BB>
0ccuracy .8CC .827 .66 .685 .663 .3;C .636
&UB-E/UK .;;2 .771 .7;7 .767 .25 .77
&UB-E/UKX> .7 .;C5 .788 .C2 .78
&UB-E/U& .73C .73C .C6 .761
B"> .;2 .833 .7;>ele: .852 .7
>Beebies .721
Percenta#e ! variance accunted !r by nn/linear re#ressin analysis @s*lines, rcs !unctin
in + wit$ 5 ntsD
&UB-E/UK @)i*!D 5C.5S
&UB-E/UK @l#@>41DD 5.1S
&UB-E/U& @)i*!D 36.SB"> @)i*!D 36.;S
>ele: @)i*!D 35.8S
Page 34
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 34/36
35
able 5' >rrelatins between t$e varius !re(uency measures and t$e B-P + data @"
2C,66D. $e u**er *art s$ws t$e crrelatins. $e lwer *art s$ws t$e *ercenta#es !
variance accunted !r by nn/linear re#ressin analyses @lm/*rcedure in +, restricted cubic
s*lines wit$ 5 ntsD.
&UB-E/UK &UB-E/UKX> &UB-E/U& B"> >ele: >Beebies >BB>
+ /.885 /.85 /.856 /.837 /.825 /.636 /.852
&UB-E/UK .;;1 .776 .;CC .782 .2 .7;3
&UB-E/UKX> .77 .;C8 .78; .C1 .77C
&UB-E/U& .722 .727 .8;7 .75
B"> .;3 .811 .1>ele: .828 .82
>Beebies .71
Percenta#e ! variance accunted !r by nn/linear re#ressin analysis @s*lines, rcs !unctin
in + wit$ 5 ntsD
&UB-E/UK @)i*!D 58.1S
&UB-E/UK @l#@>41DD 5.1S
&UB-E/U& @)i*!D 53.3SB"> @)i*!D 52.2S
>ele: @)i*!D 5C.S
Page 35
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 35/36
36
able 6' Percenta#es ! variance accunted !r by t$e varius !re(uency measure in t$e E-P
data.
0ccuracyX- +X- +Xnam
@" 5C,587D @" 33,;;D @" 33,;;D
&UB-E/U& @)i*!D 2C.6S 38.S 28.CS
&UB-E/U& @>D 22.3S 3.2S 28.1S
&UB-E/UK @)i*!D 1;.CS 35.7S 25.2S
&UB-E/UK @>D 2C.6S 35.7S 25.2S
Page 36
7/25/2019 Word Frequency for British English
http://slidepdf.com/reader/full/word-frequency-for-british-english 36/36
able 8' >rrelatins ! t$e &UB-E/UK !re(uencies wit$ t$e >PW wrd !re(uencies @all
values l# trans!rmed a!ter -a*lace trans!rmatinF " ;,126 wrd ty*es s$ared between
bt$ listsD.
&UB-E/UK @)i*!D >Beebies @)i*!D >BB> @)i*!D
>PW .885 .68 .8;C
&UB-E/UK @)i*!D .35 .;26
>beebies @)i*!D .7C3