On Some Methodological Issues of CADS Language in Politics in Slavic speaking countries Václav Cvrček
On Some MethodologicalIssues of CADS
Language in Politics in Slavic speaking countriesVáclav Cvrček
CADS and KWAIj
Corpus-assisted Discourse Studies (CADS)
Use of corpora in discourse analysis
▶ goal: text/discourse interpretation
▶ reduce researcher’s bias (Baker 2012)▶ identification of prominent topics (⇐ prominent words)▶ keywords identification and analysis
Corpus-assisted Discourse Studies (CADS)
Use of corpora in discourse analysis
▶ goal: text/discourse interpretation▶ reduce researcher’s bias (Baker 2012)
▶ identification of prominent topics (⇐ prominent words)▶ keywords identification and analysis
Corpus-assisted Discourse Studies (CADS)
Use of corpora in discourse analysis
▶ goal: text/discourse interpretation▶ reduce researcher’s bias (Baker 2012)▶ identification of prominent topics (⇐ prominent words)
▶ keywords identification and analysis
Corpus-assisted Discourse Studies (CADS)
Use of corpora in discourse analysis
▶ goal: text/discourse interpretation▶ reduce researcher’s bias (Baker 2012)▶ identification of prominent topics (⇐ prominent words)▶ keywords identification and analysis
MethodsIj
Method of identification of prominent words
1. raw or relative frequency of words in a text/corpus
2. thematic concentration (TC)3. keywords (KWs)
⇒ starting point for the interpretation
Method of identification of prominent words
1. raw or relative frequency of words in a text/corpus2. thematic concentration (TC)
3. keywords (KWs)
⇒ starting point for the interpretation
Method of identification of prominent words
1. raw or relative frequency of words in a text/corpus2. thematic concentration (TC)3. keywords (KWs)
⇒ starting point for the interpretation
Method of identification of prominent words
1. raw or relative frequency of words in a text/corpus2. thematic concentration (TC)3. keywords (KWs)
⇒ starting point for the interpretation
Note on thematic concentration
Popescu–Altmann (2006)
Discussion on thematic concentration
J. David et al.: Slovo a text v historickém kontextu. Host. 2013
Features and consequences of thematic concentration
▶ TC = identification based on the frequency distribution ofunits within a text
▶ no reference corpus is required▶ ”interpretation without the interpreter”× different readers ⇒
different interpretations
Discussion on thematic concentration
J. David et al.: Slovo a text v historickém kontextu. Host. 2013
Features and consequences of thematic concentration
▶ TC = identification based on the frequency distribution ofunits within a text
▶ no reference corpus is required▶ ”interpretation without the interpreter”× different readers ⇒
different interpretations
Discussion on thematic concentration
J. David et al.: Slovo a text v historickém kontextu. Host. 2013
Features and consequences of thematic concentration
▶ TC = identification based on the frequency distribution ofunits within a text
▶ no reference corpus is required
▶ ”interpretation without the interpreter”× different readers ⇒different interpretations
Discussion on thematic concentration
J. David et al.: Slovo a text v historickém kontextu. Host. 2013
Features and consequences of thematic concentration
▶ TC = identification based on the frequency distribution ofunits within a text
▶ no reference corpus is required▶ ”interpretation without the interpreter”× different readers ⇒
different interpretations
KeywordsIj
Keywords and KWA
Keywords
▶ homonymous term (!)
▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Keywords and KWA
Keywords
▶ homonymous term (!)▶ words with higher relative frequency in a text
▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Keywords and KWA
Keywords
▶ homonymous term (!)▶ words with higher relative frequency in a text▶ based on comparison with reference corpus
▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Keywords and KWA
Keywords
▶ homonymous term (!)▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Keywords and KWA
Keywords
▶ homonymous term (!)▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Keyword analysis (KWA)
Romeo and Juliet vs. all Shakespeare plays (Scott–Tribble 2006)
AH DEATHLY MARRIED SLAINART EARLY MERCUTIO THEEBACK FRIAR MONTAGUE THOUBANISHED JULIET MONUMENT THURSDAYBENVOLIO JULIET’S NIGHT THYCAPULET KINSMAN NURSE TORCHCAPULETS LADY O TYBALTCAPULET’S LAWRENCE PARIS TYBALT’SCELL LIGHT POISON VAULTCHURCHYARD LIPS ROMEO VERONACOUNTY LOVE ROMEO’S WATCHDEAD MANTUA SHE WILT
Keyword analysis (KWA)
Romeo and Juliet vs. all Shakespeare plays (Scott–Tribble 2006)
AH DEATHLY MARRIED SLAINART EARLY MERCUTIO THEEBACK FRIAR MONTAGUE THOUBANISHED JULIET MONUMENT THURSDAYBENVOLIO JULIET’S NIGHT THYCAPULET KINSMAN NURSE TORCHCAPULETS LADY O TYBALTCAPULET’S LAWRENCE PARIS TYBALT’SCELL LIGHT POISON VAULTCHURCHYARD LIPS ROMEO VERONACOUNTY LOVE ROMEO’S WATCHDEAD MANTUA SHE WILT
Methodological issues of KWA
1. KW identification and the question of KWs ranking2. Role of reference corpus
How to measure keynessIj
Keywords identification
How do we usually proceed?
1. count frequency of each word in a target text – most frequentwords are the, of, was…
2. compare it with a frequency of the same word in a referencecorpus
3. use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant
4. interpret top X most significant keywords
Keywords identification
How do we usually proceed?
1. count frequency of each word in a target text – most frequentwords are the, of, was…
2. compare it with a frequency of the same word in a referencecorpus
3. use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant
4. interpret top X most significant keywords
Keywords identification
How do we usually proceed?
1. count frequency of each word in a target text – most frequentwords are the, of, was…
2. compare it with a frequency of the same word in a referencecorpus
3. use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant
4. interpret top X most significant keywords
Keywords identification
How do we usually proceed?
1. count frequency of each word in a target text – most frequentwords are the, of, was…
2. compare it with a frequency of the same word in a referencecorpus
3. use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant
4. interpret top X most significant keywords
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)
▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests
2. ranking of KWs – task for a different metric
Significance versus relevance
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is asymptotically true)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Metric for keynessGabrielatos, C. & Marchi, A. (2012): ProcDiff
ProcDiff = RelFq(Target)− RelFq(Reference)RelFq(Reference) × 100
But what if RelFq(Reference) = 0?
A. Kilgarriff’s (2009) Simple math approach: add X (=1, 10…)
ratio =RelFq(Target) + X
RelFq(Reference) + X
Different values of X yield different results
Metric for keynessGabrielatos, C. & Marchi, A. (2012): ProcDiff
ProcDiff = RelFq(Target)− RelFq(Reference)RelFq(Reference) × 100
But what if RelFq(Reference) = 0?
A. Kilgarriff’s (2009) Simple math approach: add X (=1, 10…)
ratio =RelFq(Target) + X
RelFq(Reference) + X
Different values of X yield different results
Metric for keynessGabrielatos, C. & Marchi, A. (2012): ProcDiff
ProcDiff = RelFq(Target)− RelFq(Reference)RelFq(Reference) × 100
But what if RelFq(Reference) = 0?
A. Kilgarriff’s (2009) Simple math approach: add X (=1, 10…)
ratio =RelFq(Target) + X
RelFq(Reference) + X
Different values of X yield different results
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN
▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)
▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)
▶ 100 (=when a word is present only in the target corpus)▶ represents the proportion of the difference of relative
frequencies to their mean (× 50)▶ no zeroes in the denominator × identical value of DIN for
words appearing in a target text only▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)
1cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient1:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the reference corpus)▶ 0 (=when a word occurs equally in target and reference corpus)▶ 100 (=when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ no zeroes in the denominator × identical value of DIN forwords appearing in a target text only
▶ useful for ranking of KWs (not for their identification!)1cf. Hofland–Johansson (1982).
Example values
Size of a target corpus 1,000,000Size of a reference corpus 1,000,000Fq(target) 5500Fq(reference) 5000
LL = 23.82 ⇒ p < 0.001
The difference is highly significant, but…
DIN = 100× 0.55− 0.5
0.55 + 0.5= 4.76
…almost irrelevant (the effect size of the difference is negligible)
Example values
Size of a target corpus 1,000,000Size of a reference corpus 1,000,000Fq(target) 5500Fq(reference) 5000
LL = 23.82 ⇒ p < 0.001
The difference is highly significant, but…
DIN = 100× 0.55− 0.5
0.55 + 0.5= 4.76
…almost irrelevant (the effect size of the difference is negligible)
Example values
Size of a target corpus 1,000,000Size of a reference corpus 1,000,000Fq(target) 5500Fq(reference) 5000
LL = 23.82 ⇒ p < 0.001
The difference is highly significant, but…
DIN = 100× 0.55− 0.5
0.55 + 0.5= 4.76
…almost irrelevant (the effect size of the difference is negligible)
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)▶ manageable size of texts (1500 tokens per speech)▶ reference corpus: Totalita – 15 mil. words (1952–1977) of
written Czech; communist newspaper
http://kwords.korpus.cz
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)▶ manageable size of texts (1500 tokens per speech)▶ reference corpus: Totalita – 15 mil. words (1952–1977) of
written Czech; communist newspaper
http://kwords.korpus.cz
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)
▶ manageable size of texts (1500 tokens per speech)▶ reference corpus: Totalita – 15 mil. words (1952–1977) of
written Czech; communist newspaper
http://kwords.korpus.cz
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)▶ manageable size of texts (1500 tokens per speech)
▶ reference corpus: Totalita – 15 mil. words (1952–1977) ofwritten Czech; communist newspaper
http://kwords.korpus.cz
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)▶ manageable size of texts (1500 tokens per speech)▶ reference corpus: Totalita – 15 mil. words (1952–1977) of
written Czech; communist newspaper
http://kwords.korpus.cz
Needle in a Haystack Project
Suitable data for testing the limits of KWA
▶ presidential New Year’s addresses (NYA) of Gustáv Husák(1975–1989)
▶ presumed to be flat, ritualistic and monotonous, full of cliches– perfect for testing limits of keyword analysis (KWA)
▶ same author, same genre/situation × time (and topic)▶ manageable size of texts (1500 tokens per speech)▶ reference corpus: Totalita – 15 mil. words (1952–1977) of
written Czech; communist newspaper
http://kwords.korpus.cz
Difference between LL and DIN (Dice)
0 100 200 300 400 500
010
020
030
040
050
0All KWs
Dice (rank)
Log−
likel
ihoo
d (r
ank)
Example 1: Grammatical wordsKeywords from all Husák’s New Year’s Addresses
0 100 200 300 400 500 600
010
020
030
040
050
060
0
All NYA (gram. words highlighted)
Dice (rank)
Log−
likel
ihoo
d (r
ank)
1987
spoluobčané
1986
rozkvétala
přičiňmestřízlivým
domovům
vzkvétalapozdravuji
vstupujeme
1983
drazí
1982
dařila
udělejme
přeji
připomeneme
dopadyuplynulým
xvii
pokročili
novoroční
zamýšlíme
zdravím
pohodupohodě
svědomitou
posílámeoptimismempoděkovatxvi
vzestupný
opíráme
generacím
přikládámekvalitněji
spokojený
vážení
náročněvašim
zdravímeuplynulý
přátelé
důvěrou
opravňujínejspolehlivější
rozvíjelo
prožili
tvořivá
1981
hodnotíme
vzpomínat
plodem
nestraníkůvýhodnounastávajícím
rodinném
vážené
upevnili
podporujeme
přátele
soudružky
poctivou
opravňuje
spokojenost
realisticky
srdečně
energičtěji
přispěli
nadcházejícím
odvrácení
díváme
spokojenosti
bratrskému
přátelům
odhodláním
klademe
efektivněji
dialogu
upřímně
osvobozováníudrželi
uplynulého
oceňujeme
rozkvět
prošli
činorodéobětavávykonanouzačneme
všestranná
vážíme
obětavé
hrozby
tvořivou
připomněli
vyvrcholeníÚspěšně
přejemejdeme
obav
věříme
spolehlivou
mírový
příslušníkům
vstupu
uvědomujeme
hrdíškolskýchdynamiku
jistot
osmé
prahu
usilujeme
prostá
chciuplynulém
připomínat
přáteli
ústavech
historickýmispokojeněvykonaliodhodláni
konstruktivní
občanům
tužby
vyspělou
pramení
jménem
přáním
osvobozenecký
pozdravy
hranicemi
vůlí
inteligenci
pozitivníchodpovědně
přestavběsložitá
rozloučili
dobrým
horečného
měnydůstojněrovnováhu
jistoty
slabákontinentěbratrský
rokem
považujeme
podílelišťastného
dějinnéhladiny
poctivédovolte
děkuji
štěstí
složitou
Čechů
vyspělostnadcházející
náročné
pevným
překonávat
šťastný
zastupitelských
rozdílným
slováků
národností
vítěznou
pozdravil
spojenectví
desetiletí
katastrofy
upřímné
dnešním
angažovanost
samozřejmé
kriticky
zlepšovat
úspěšný
zdraví
pozvednout
složitost
přesvědčeni
spravedlnost
rozkvětu
budeme
Československá
vyžadovat
naléhavé
ženámtvořivé
jaderné
hrdostikonfrontacezasedáních
osobním
naléhavěhovořímedařilo
hrdostí
progresivních
důvěry
našim
příznivý
životě
rolníkům
společenství
minulého
sborů
rozvíjela
ústředního
překážek
vám
stručněmládeži
mírového
zhodnotil
společným
vlast
přesvědčen
soudružské
bratrskými
osvobozeneckého
otevřela
mírových
důvěra
dělníkům
pevné
podnětem
abyste
metra
obětavou
celkověpřekonáváníspojenci
urychlení
drahé
potvrdily
náročnýzmařit
můžemevlasti
přestavby
odzbrojení
uplynulých
úsecích
fronty
uskutečňovat
mírovému
varšavské
zajistili
výsledkům
dosažené
zřízením
Československo
reálné
spojeno
Československa
dopravě
úsekůpodílí
nového
obětavě
užívání
službách
uvolňování
prohlubovat
udržet
aktivně
překonání
prohlubování
zápasu
všestranný
krizových
nejvyšších
vzájemně
ekonomikunemálo
světě
pokrokovýmnezbytnástarat
úspěšného
zničení
zápas
žít
usilovat
uspokojením
čelit
továrnách
rok
občany
vrstev
zdravotnických
důraz
částech
kultuře
náročných
vyžadují
dobré
1978
pokračoval
významných
blahovšestrannéhoodkazu
svazem
příznivé
našeho
podporuje
bratrské
energetickéhospodařitvstříc
smyslem
museli
přispívat
široká
jaderných
životních
solidaritutěžkosti
státy
úspěchů
srdce
pokroku
radost
správě
příštích
dalšími
říci
důkazzdůraznit
soudruzi
odkaz
socialistického
složitých
napětí
bratrských
události
rovnosti
právem
prospěchu
zlepšování
vědomím
minulém
občanů
vnější
zdrojem
dosáhli
víme
zdravotnictví
život
zabezpečit
mírovou
stupních
dobrou
všem
stupňů
plnou
loňském
uskutečňování
láskou
hodně
zápase
soužití
vás
lidstvo
lépe
socialistickými
zřízení
nimiž
národně
naší
správnou
armádou
duchovní
dobrýchnovém
pevnou
letošním
kupředu
prosinci
sovětským
povinnosti
lidu
záměry
chceme
upevňování
současných
složek
sociálních
výsledky
ovzduší
náš
vývojem
našich
abychom
potřebám
přínospotvrzují
perspektivy
odhodlání
československého
všestranné
všude
roku
krok
vůle
pracovišti
hmotné
pětiletky
generace
zásluhou
významné
republika
nezávislosti
aktivita
československé
stavbách
společenskýotevřeně
pokračovat
rychleji
životní
nadále
budoucnosti
řešit
dobrá
xiv
mezinárodních
pracovat
dobrýbezpečnost
společný
podporu
světem
pracovištích
vlastenectví
abych
cestou
věnovat
jsme
cesta
mírové
přání
národní
lepší
úspěchy
rostoucí
rozvoji
dalšího
pozitivní
našimi
máme
životanaše
milióny
rozvíjet
výsledků
zbrojení
pokrok
státu
našemu
důsledně
návrhy
podmínkách
volby
vztazích
dalším
šesté
komunistické
oblastech
školstvíveškeré
zeměmi
občané
roce
ozbrojených
zajištění
sovětskou
zlepšeníjednoty
zachování
úsilí
zájmům
úroveň
dalších
inteligencezávěrykaždém
socialistických
vysoce
mezinárodním
jednou
postup
díky
vysokounedostatkůmateriální
úspěšně
vývoj
evropě
příští
kterém
úkoly
vše
socialistická
plní
pracujícího
potřebné
krize
upevnění
rozvoj
cestu
odpovídápostupujakýdalšímu
svým
společné
mezinárodní
program
naši
musíme
výročí
národů
spoluprácemnoho
ekonomického
cíle
hospodářství
možností
úrovně
politika
států
bezpečnosti
plně
velkou
kterýmlidstva
nás
společnosti
vědeckých
splnění
celémzájmu
nedostatky
rozvoje
prací
přátelství
práci
znovu
našem
míru
svobody
cen
socialistické
spolupráci
politice
tvůrčí
problémů
postavení
dále
sjezd
výboru
spolu
všech
problémy
úkolů
národy
sociální
hospodářského
organizací
potřeb
politiku
společenské
země
nových
orgánůvelký
lid
celého
růstuvztahy
celé
další
i
plněnísíly
lidí
nám
zasedání
sovětskéhopracujících
řešení
sil
politiky
pracovní
svousjezdu
proto
aby
let
svazu
práce
všechny
které
pro
si
a
to
strany
ve
s
19871986
19831982
1981
našim
vám
abyste
1978
našeho
vstříc
všem
vás
nimiž
naší
náš
našich
abychom
abych
našimi
naše
našemu
veškeré
každém
díkykterém
vše
jaký
svým
naši
kterým
nás
našem
všech
i
nám
svou
proto
aby
všechny
které
pro
si
a
tove
s
Example 2: Topical wordsnašimi (”our”, Inst. pl.) × rest of the lemma náš (”our”, all cases)
0 100 200 300 400 500 600
010
020
030
040
050
060
0
All NYA (gram. words only)
Dice (rank)
Log−
likel
ihoo
d (r
ank)
19871986
19831982
1981
našim
vám
abyste
1978
našeho
vstříc
všem
vás
nimiž
naší
nášnašich
abychom
abych
našimi
naše
našemu
veškerékaždém
díkykterém
vše
jaký
svým
naši
kterým
nás
našem
všech
i
nám
svou
proto
aby
všechny
které
pro
si
a
tove
s
našim
našeho naší
nášnašich
našimi
naše
našemu
naši
našem
Reference corpus in KWAIj
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWs
composition: different reference corpora represent different readers
▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
▶ balanced corpus ∼ general reader
▶ specialized corpus ∼ specific reader (e.g. fromthe past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially
▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Detailed comparison – 3 thematic groups
Cold war: mír, míru, mírova, mírove, míroveho, mírovemu,mírovou, mírovy, mírovych, mírovymi, mírumilovne,mírumilovnych, mírumilovnym; napetí; odzbrojení,vyzbroje, zbrojení, zbrojením, ozbrojenych
Collective possession: nas, nase, naseho, nasem, nasemu, nasi,nasí, nasich, nasim, nasím, nasimi
Ideo markers: socialismu, socialismus, socialisticka, socialisticke,socialistickeho, socialistickem, socialistickemu,socialistickou, socialisticky, socialistickych,socialistickym, socialistickymi; komunismu,komuniste, komunistu, ksc, komunistum, komunistykomunisticka, komunisticke, komunistickym
Cold war40
5060
7080
9010
0
Cold War KWs in SYN−KWA and TOT−KWA
Year
DIN
SYN−KWATOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrcek (forthcomming)
Collective possession65
7075
8085
9095
KWs "our" in SYN−KWA and TOT−KWA
Year
DIN SYN−KWA
TOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrcek (forthcomming)
Ideological markers30
4050
6070
8090
100
Ideological markers KWs in SYN−KWA and TOT−KWA
Year
DIN SYN−KWA
TOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrcek (forthcomming)
ConclusionsIj
ConclusionsRanking of keywords
▶ statistical significance = relevance
▶ the effect size of the difference is as important as thesignificance
Role of reference corpus
▶ different reference corpora can be used to model differentreadings of the same text
▶ the difference is in the sensitivity (suppressed or increased) tocertain topics
▶ genre matters!
ConclusionsRanking of keywords
▶ statistical significance = relevance▶ the effect size of the difference is as important as the
significance
Role of reference corpus
▶ different reference corpora can be used to model differentreadings of the same text
▶ the difference is in the sensitivity (suppressed or increased) tocertain topics
▶ genre matters!
ConclusionsRanking of keywords
▶ statistical significance = relevance▶ the effect size of the difference is as important as the
significance
Role of reference corpus
▶ different reference corpora can be used to model differentreadings of the same text
▶ the difference is in the sensitivity (suppressed or increased) tocertain topics
▶ genre matters!
ConclusionsRanking of keywords
▶ statistical significance = relevance▶ the effect size of the difference is as important as the
significance
Role of reference corpus
▶ different reference corpora can be used to model differentreadings of the same text
▶ the difference is in the sensitivity (suppressed or increased) tocertain topics
▶ genre matters!
ConclusionsRanking of keywords
▶ statistical significance = relevance▶ the effect size of the difference is as important as the
significance
Role of reference corpus
▶ different reference corpora can be used to model differentreadings of the same text
▶ the difference is in the sensitivity (suppressed or increased) tocertain topics
▶ genre matters!
References▶ Baker, P. (2012): Acceptable bias? Using corpus linguistics with critical
discourse analysis. Critical discourse studies 9(3): 247-256.▶ David, J. et al.: Slovo a text v historickém kontextu. Host. 2013▶ Fidler, M. – Cvrček, V. (forthcoming): A data-driven analysis of reader
viewpoints: Reconstructing the historical reader using keyword analysis.▶ Gabrielatos, C. – Marchi, A. (2012) Keyness: appropriate metrics and
practical issues. CADS International Conference, Bologna, Italy(www.gabrielatos.com/Presentations.htm).
▶ Hofland – Johansson (1982): Word frequencies in British and AmericanEnglish. Bergen: The Norwegian computing centre for the Humanities.
▶ Kilgarriff, A. (2009): Simple maths for keywords proc. Corpus Linguistics.Liverpool. UK(http://ucrel.lancs.ac.uk/publications/cl2009/171_FullPaper.doc).
▶ Popescu, I. – Altmann, G. (2006): Some aspects of word frequencies.Glottometrics 13, p. 23–46.
▶ Scott, M. – Tribble, C. (2006): Textual patterns: Keyword and corpusanalysis in language education. Amsterdam: Benjamins.
Thank you for your attention!