(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists Tobias G¨ artner July 20 th , 2016
(Semi-) Big Data Corpora: New Challanges and
New Solutions for Corpus Linguists
Tobias Gartner July 20th, 2016
Content
1. Introduction
2. Technical Prerequisites
3. Multivariate Procedures
Page 2Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
New methods in corpus linguistics
Let’s assume:
Your corpus contains 7,834 texts with 4,441,087 tokens(ICLE & ICNALE)
You want to analyse the present perfect (+ continuous)active
Additionally you assume an influence of several othervariables
You don’t have an army of PhD students and studentassistants
Page 3Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Why do we need to apply new methods?
Thus:
Manual counting is tedious
Manual counting is error prone
Manual counting is expensive (time and money wise)
and you still need to analyse your findings
Page 4Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Why do we need to apply new methods?
Univariate statistical procedures become misleading onceseveral variables are to be assessed simultaneously
Crosstables become complex and confusing as theyexponentially grow
ANOVAs and linear regressions return misleading (actuallyplain wrong) results for non-normally distributed count data
Page 5Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Data Sources
Sources
Internal
Texts
BGSU1043
BGSU1089
. . .
Database
ICLE Database
ICNALE Database
External
MRC Psycholinguistic Database
Academic Word List
Page 6Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Possible independent variables to analyse
N Words N Sentences
Avg. Sentence
Length
Avg.Word
Length
N FunctionalWords
N LexicalWords
N ActiveSentences
N PassiveSentences
Negation
SubordinateSentences
Type-TokenRatio
Flesch-Kincaid
Reading-EaseLog(N Words)
AcademicWord
List Score
MeanCohesion
(POS)
SdCohesion
(POS)
MeanCohesion
(Word)
SdCohesion
(Word)
ReaderVisibility
WriterVisibility
N “not”
MeanN Phonemes
MeanN Letters
MeanN Syllables
Kucera-Francis
Frequency
AgeOf Acquisition
Familiarity
Concreteness Imagery
MeanPavio
Meaning-fulness
MeanColoradoMeaning-fulness
Gender Age
YearsOf
Tuition
YearsAt
Unitiversity
MajorAcademic
Genre
N Words
Country
Surface StructuresPsychological VariablesDegree of Academic Writing Social Variables
Figure 1: Overview over independent variables
Page 7Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Machine readable texts with meta information
PlainText
ICLE
ICNALE
POS-Tagger
Parser
koRpus LSA
pscl
Figure 2: Plain text to part-of-speech tagged xml files to an Rdata-frame
Page 8Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
The present perfect as a syntactical tree
S
VP
VBP
have
... VP
VBN
Participle
S
VP
VBP
have
... VP
VBN
been
VP
VBG
Gerund(a) Present Perfect (b) Present Perfect Progressive
Figure 3: Syntactical trees using the Penn Treebank Tag Set Page 9Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
The present perfect in X-Query
1 f o r $ i i n c o l l e c t i o n ( ” c o r p u s ” ) // eTree [ @Label=”S” ] /eTree [ @Label=”VP” ]
2 where $ i / eTree / @Label=”VBP”3 and $ i / eTree / e L e a f /@Text=” have ”4 and $ i / eTree / @Label=”VP”5 and $ i / eTree / eTree / @Label=”VBN”6 and not ( $ i / eTree / eTree / e L e a f /@Text=” been ” )7 r e t u r n base−u r i ( $ i )
Listing 1: X-Query code for the present perfect
Page 10Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Collinearity
Corpus
Database
TenseAnd
Aspect
IndependentVariables
Factor-Analysis
VarianceInflationFactor
Gauß-Elimination
LiniearlyIndependentIndependent
Variables
Multi-Collinearity
Figure 4: Procedure to avoid (multi-) collinearity
Page 11Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Linear Regression
0 1 2 3 4 5 6 7 8 9 11 13 15 17 20
PP and PPP Constructions per Text
Fre
quen
cy
050
010
0015
0020
0025
0030
0035
00
Figure 5: Histogram of the present perfect active and the presentperfect progressive active in ICLE & ICNALE
Page 12Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Linear Regressions
Three reasons why a linear regression is not a smart idea:
1. The dependent variable is by no means normally distributed(Anderson-Darling test: p < 2.2e−16)
2. The distribution is heavily skewed (skewness = 2.70988)
3. The dependent variable is count data, i.e. consists solely ofintegers
Page 13Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Count Data Regressions
So why not use classical count data regressions?
3534 out of 7834 texts (≈ 45%) do not contain the requiredconstruction!
Page 14Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Excess Zero Problem
There are several regression types that deal with the excesszero problem:
Hurdle models
Ordinal models
Zero-inflated models
Page 15Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Excess Zero Problem
Model AIC-Value Drop in AIC in %Linear 30072.76 Reference ModelPoisson 23627.33 -21.43Negative Binomial 22244.58 -26.03Ordinal 22270.14 -25.94Hurdle (Poisson) 22901.65 -23.85Hurdle (Neg.Binomial) 22799.20 -24.19Zero-Infl. (Poisson/Logit) 22848.17 -24.02Zero-Infl. (Poisson/Probit) 22861.85 -23.98Zero-Infl. (Poisson/Cauchit) 22807.94 -24.16Zero-Infl. (Poisson/Compl.Log.-Log.) 22833.85 -24.07Zero-Infl. (Neg.Bin./Logit) 22132.78 -26.4Zero-Infl. (Neg.Bin./Probit) 22133.28 -26.4Zero-Infl. (Neg.Bin./Cauchit) 22135.46 -26.39Zero-Infl. (Neg.Bin./Compl.Log.-Log.) 22135.61 -26.39
Page 16Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Zero-Inflated Neg. Bin./Probit Regression
P(Yi = yi) =
ωi + (1 − ωi)exp(−αλci )−λ1−ci /α , yi = 0
(1 − ωi)Γ (yi + λ1−c
i /α)
yi !Γ (λ1−ci /α)
×(1 + αλci )−λ1−ci /α
×(1 + λ1−ci /α)−yi , y > 0
where ω = φ(X ′β)
Page 17Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
Average Marginal Effects
AMEi = βi1
n
n∑k=1
(φβxk) (1)
Page 18Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
The Results I
Table 1: Zero-Inflated Negative Binomial Regression with ProbitLink
Dependent variable:
Present Perfect and Present Perfect Continuous
Zero-Inflated Zero-InflatedNeg. Bin. with Log. link Binomial with Probit link
β-Coef. S.E. β-Coef. AME S.E.(Intercept) 0.8308 0.4779 . 1.2515 0.163 2.5103Combined Text Length 0.3174 0.0424 *** -1.1624 -0.151 0.2856 ***Text Length Relations -0.0370 0.0392 -0.5867 -0.076 0.2852 *
Type-Token Ratio -0.1541 0.4098 -6.3566 -0.828 1.5727 ***Cohesion (µ) -0.1653 0.3487 1.5112 0.197 1.5061Cohesion (σ) -0.8735 0.7246 -3.4195 -0.445 2.7403
Academic Word List -0.0108 0.3589 -0.8702 -0.113 2.3809
Reader/Writer Visibility 1 -0.0096 0.0230 -0.3583 -0.047 0.2306
Page 19Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
The Results II
Reader/Writer Visibility 2 -0.0851 0.0361 * -0.9777 -0.127 0.6859
Grade (truncated) -0.0055 0.0159 0.1541 0.02 0.0966Years of English Tuition -0.0066 0.0079 -0.0139 -0.002 0.0350Age (truncated) -0.0189 0.0058 ** -0.0246 -0.003 0.0491
Female 0.0624 0.0387 0.5358 0.07 0.3035 .Default = Male
Social Sciences -0.3159 0.1340 * 0.1415 0.018 0.2818Science -0.3562 0.1249 ** 0.0200 0.003 0.3054Default = Humanities
MRC Length 0.0981 0.0386 * 0.2061 0.027 0.1863MRC Frequency -0.0437 0.0237 . -0.1096 -0.014 0.1360MRC Concreteness -0.0005 0.0217 0.1852 0.024 0.0834 *
Belgium 0.4296 0.1772 * -0.5692 -0.074 3.0118Botswana -0.1483 0.1929 -0.2078 -0.027 1.1476Bulgaria 0.4849 0.1795 ** -0.1022 -0.013 1.5073Czech Republic 0.1927 0.1966 1.1815 0.154 1.3433ESL -0.3318 0.1930 . 0.9345 0.122 1.3812Finland 0.6938 0.1808 *** 0.3254 0.042 1.4068Greater China -0.1601 0.1844 0.6470 0.084 1.2497
Page 20Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016
The Results III
Germanic 0.2204 0.1815 1.3361 0.174 1.1574IndoThai -0.8334 0.2687 ** 0.3768 0.049 1.3308Italy 0.5099 0.1883 ** -0.1500 -0.02 1.8934Japan -0.1109 0.1875 0.8262 0.108 1.2544Korea -0.6147 0.2900 * 1.1176 0.146 1.2690Norway 0.7539 0.1711 *** -3.8321 -0.499 136.0929Poland 0.2089 0.1857 0.7670 0.1 1.7382Russia 0.1917 0.2009 1.5672 0.204 1.1829Spain 0.6052 0.1860 ** 0.4621 0.06 1.4366Sweden 0.7369 0.1701 *** -3.5256 -0.459 147.8183Turkey 0.0886 0.2032 1.0227 0.133 1.4696Default = NS
Log(theta) 0.8715 0.0720 ***
Note: Significance Levels: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
Page 21Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016