Man - Machine - Gene 1 Predicting cognitive ability, non-cognitive traits and educational attainment from teacher 2 assessments, short essays and the genome 3 Tobias Wolfram ∗ Felix C. Tropf † 4 Abstract 5 To what extent can nonstandard types of data predict psychological and social outcomes over the life 6 course? We leverage a unique British dataset to study the predictive utility of short essays written at 7 age 11 and genetic polymorphisms. Using state-of-the-art methods from natural language processing and 8 genomics, we find that both approaches predict cognitive ability, non-cognitive traits and educational 9 attainment with in part impressive precision: Performance based on the text samples (up to 61, 9 and 10 25%) mirrors that of teacher evaluations (up to 66, 19, 29%) obtained at the same age. Prediction from 11 genetic data is overall substantial, but measurably smaller (up to 17, 5, 19%). Combining all three sources 12 of data explains 38% of variation in educational attainment and 70% in cognitive ability, approaching 13 test-retest reliability of benchmark intelligence tests. We conclude that in order to improve predictive 14 performance in the social and behavioral sciences, more attention should be paid to nonstandard data 15 sources. 16 Prediction, the ability to assert that certain changes will be accompanied by or lead to other changes, lies 17 at the very heart of the scientific endeavour (p 339, Popper, 1962). This is no less true of the social and 18 behavioral sciences: The task of forecasting individual and collective behavior from the circumstances of the 19 present and past has been argued for by philosophers of science since the earlier days of the field, beginning 20 with the concept of social prediction (Kaplan, 1940) and the conclusion that the causal explanation of a 21 social phenomenon must also serve as the basis of its future prediction (Hempel and Oppenheim, 1948). 22 However, such appeals had not found lasting resonance in disciplinary practice, until recently: Over the 23 past few years numerous behavioral and social scientists began to argue for the importance of and potential 24 for prediction in their respective fields (Watts, 2014; Kleinberg et al., 2015; Cranmer and Desmarais, 2017; 25 Yarkoni and Westfall, 2017; Hofman et al., 2021), a development in part driven by the increased availability 26 of data, computational power and sophistication of machine learning algorithms (Rahal et al., 2021). 27 Such approaches can be especially beneficial when focusing on predicting life outcomes, allowing for the 28 ∗ Bielefeld University, Faculty of Sociology, Universitätsstraße 25, 33615 Bielefeld, Germany † École Nationale de la Statistique et de L’administration Économique, Center for Research in Economics and Statistics, 91764 Palaiseu, France 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Man - Machine - Gene1
Predicting cognitive ability, non-cognitive traits and educational attainment from teacher2
assessments, short essays and the genome3
Tobias Wolfram∗ Felix C. Tropf†4
Abstract5
To what extent can nonstandard types of data predict psychological and social outcomes over the life6
course? We leverage a unique British dataset to study the predictive utility of short essays written at7
age 11 and genetic polymorphisms. Using state-of-the-art methods from natural language processing and8
genomics, we find that both approaches predict cognitive ability, non-cognitive traits and educational9
attainment with in part impressive precision: Performance based on the text samples (up to 61, 9 and10
25%) mirrors that of teacher evaluations (up to 66, 19, 29%) obtained at the same age. Prediction from11
genetic data is overall substantial, but measurably smaller (up to 17, 5, 19%). Combining all three sources12
of data explains 38% of variation in educational attainment and 70% in cognitive ability, approaching13
test-retest reliability of benchmark intelligence tests. We conclude that in order to improve predictive14
performance in the social and behavioral sciences, more attention should be paid to nonstandard data15
sources.16
Prediction, the ability to assert that certain changes will be accompanied by or lead to other changes, lies17
at the very heart of the scientific endeavour (p 339, Popper, 1962). This is no less true of the social and18
behavioral sciences: The task of forecasting individual and collective behavior from the circumstances of the19
present and past has been argued for by philosophers of science since the earlier days of the field, beginning20
with the concept of social prediction (Kaplan, 1940) and the conclusion that the causal explanation of a21
social phenomenon must also serve as the basis of its future prediction (Hempel and Oppenheim, 1948).22
However, such appeals had not found lasting resonance in disciplinary practice, until recently: Over the23
past few years numerous behavioral and social scientists began to argue for the importance of and potential24
for prediction in their respective fields (Watts, 2014; Kleinberg et al., 2015; Cranmer and Desmarais, 2017;25
Yarkoni and Westfall, 2017; Hofman et al., 2021), a development in part driven by the increased availability26
of data, computational power and sophistication of machine learning algorithms (Rahal et al., 2021).27
Such approaches can be especially beneficial when focusing on predicting life outcomes, allowing for the28
∗Bielefeld University, Faculty of Sociology, Universitätsstraße 25, 33615 Bielefeld, Germany†École Nationale de la Statistique et de L’administration Économique, Center for Research in Economics and Statistics,
91764 Palaiseu, France
1
identification of individuals of elevated risk with respect to various socially relevant variables (Chandler et al.,29
2011; Kleinberg et al., 2015; Berk et al., 2019) in order to intervene. They furthermore might lead to a30
better understanding of the rigidity of life trajectories and outcomes (Salganik et al., 2020), including family31
planning, professional careers (Geyer et al., 2006), longevity and health (Lleras-Muney, 2005) as well as values32
and beliefs (Weakliem, 2002) and the risk of receiving a criminal conviction (Lochner and Moretti, 2004).33
For this, sociologists and economists stress the importance of educational attainment, as it has been shown to34
be a central stratum of life course trajectories and careers. Raising the level of educational attainment has for35
a long time been a major goal of policymakers and the extent to which education works as a “great equalizer”36
is often viewed as a central measure in determining the degree of meritocracy within a society (Bernardi and37
Ballarino, 2016) leading to sizable proportions of GDP being invested into educational institutions (p 244,38
OECD, 2021b).39
Nevertheless, in developed societies, educational attainment is completed relatively late in life: On average40
across OECD countries, more than half of 18-24 year-olds and even 16% in the age group of 25-29 year olds41
have not yet finished their education (p 54, OECD, 2021b), limiting its potential for forecasting purposes as42
an early warning sign of detrimental development and making it necessary to find measures predictive of life43
outcomes that are available earlier and therefore allow for intervention and educational streamlining (Hart,44
2016). Furthermore, it is questionable, to what extent educational attainment might in itself be just a proxy45
for cognitive ability and non-cognitive traits of the individual, with little value in itself (Caplan, 2019).46
A commonly used alternative, i.e. in the case of academic tracking (Baeriswyl et al., 2011), are teacher47
assessments, as it can be assumed that the judgment of a pedagogically trained professional who spends48
significant amounts of time with a student on a daily basis is capable of correctly evaluating his potential from49
an early age on. Indeed, meta-analyses indicate that teacher assessments show moderate to high correlations50
with important traits like academic achievement, cognitive ability, creativity and social skills (Urhahne and51
Wijnia, 2021). However, they might also exhibit potential biases related to factors like sex, class or race52
(Campbell, 2015).53
Due to such drawbacks, debate on a new measure of children’s abilities, educational attainment and subsequent54
2
life course outcomes has recently gained traction: Genetic data (Hart, 2016; Plomin, 2018; Morris et al.,55
2020). While birth weight as an early biological proxy has been introduced to economic research a while ago,56
twin studies pointed towards strong genetic influences on educational achievement (Krapohl et al., 2014),57
attainment (Silventoinen et al., 2020) and other socioeconomic factors (Marks, 2017). Today, more and more58
studies can attribute these effects directly to actual variants in the genome i.e. Lee et al. (2018), allowing for59
the direct prediction of education and other outcomes from genotyped data (Selzam et al., 2017; von Stumm60
et al., 2020), in theory starting from the point of conception.61
Furthermore, there is growing evidence for the existence of a completely different, early available measure62
that contains valuable information about the individual: Textual data. Alvero et al. (2021) show that content63
and style of an essay are related to household income and SAT scores. Writing samples allow for modest64
prediction of personality facets Fast and Funder (2008); Cutler et al. (2021), mental health (Rodriguez et al.,65
2010), cognitive ability (Abramov and Yampolskiy, 2019) and educational achievement (Cöltekin, 2020).66
Rapid recent advances in deep learning based natural language processing (i.e. Brown et al., 2020) that have67
not been utilized in any of the aforementioned studies imply that the ceiling for text-based prediction of life68
outcomes might not have been reached yet.69
In this paper, we for the first time contrast the predictive power of teacher assessments (man), deep-70
learning-based prediction from textual data (machine), and genomic data (gene) in respect to cognitive,71
non-cognitive traits and educational attainment. In addition, we analyze the role of cognitive, non-cognitive72
and discriminatory factors contributing to the different prediction techniques.73
In order to achieve this goal, we rely on a unique data source: The National Child Development Study (NCDS,74
Power and Elliott, 2006), an ongoing British birth cohort study started in 1958. At age 11, participants were75
requested to write an essay under the theme “Imagine you are 25.” of roughly 250 word length. At the same76
time, teachers were asked to give an assessment of the respondents abilities and behaviors. Eventually, in77
2002, blood samples of participants were collected and later genotyped (see Materials & Methods).78
Both essays and genotyping results confront us with the challenges of unstructured big data and require79
extensive feature engineering. We therefore leverage state-of-the-art deep learning language models (Vaswani80
3
et al., 2017) and over 500 lexicographic metrics to create more than 1000-dimensional numerical representations81
of each essay. Likewise, we reduce the complexity of our genetic data, which encompasses more than 35 million82
single nucletoide polymorphisms (SNPs): By utilizing publicly available summary statistics of genome-wide83
association studies (GWAS) we use different subsets of all available SNPs to construct polygenic scores (PGS)84
for a set of 33 curated traits, likely to be associated with our outcomes of interest in a multi-polygenic score85
approach to trait prediction (Krapohl et al., 2018). In contrast, only 22 items are used for teacher evaluations86
(see SI for a detailed list). Thus constructed covariates are used as cross-validated input to an ensembled87
array of established machine learning algorithms in the SuperLearner-framework (Van der Laan et al., 2007;88
Polley and Van Der Laan, 2010) to maximize prediction while guarding against overfitting.189
Figure 1: Overview of the research design: Genes, essays and teacher assessments are used to constructcovariates as input to an ensembling algorithm (SuperLearner) that is used to predict cognitive ability,noncognitive traits and educational attainment.
Following this design (also schematically displayed in Figure 1), our research is highly relevant to various90
current debates in the social and behavioral sciences and the broader societal discourse: First, the ability of91
computational social scientists to predict individual life outcomes in general has been questioned (Turkheimer92
and Waldron, 2000; Salganik et al., 2020). By looking at a broad range of variables measured at different93
points in time and multiple sets of innovative predictors, as well as by using best practices from the field of94
1All these aspects are described in more detail in the Materials & Methods section as well as in the SI.
4
machine learning, our work presents an important additional test of this gloomy prospect.95
Second, behavioral geneticists suggest that genetic factors are predictive of various life outcomes (Polderman96
et al., 2015), including psychological ones as well as educational attainment. Twin studies suggest that genes97
explain up to 50% of individual differences, while out of sample predictions based on molecular data remain98
smaller at around 17 percent (Okbay et al., 2022). Given the apparently consistent ability of genetic variants99
to predict education across Western populations (Rietveld et al., 2013), and the interpretation of non-genetic100
variation as unsystematic (Plomin, 2011) or luck (Jencks et al., 1972) and therefore unpredictive, it is even101
debated to use genes as incremental information for college admission (Plomin, 2018; Harden, 2021). Our102
direct comparison of genomic prediction to textual data and teacher assessments provides an important103
benchmark in debates on the applicability of genetic predictors in an educational context (Hart, 2016; Plomin,104
2018; Morris et al., 2020).105
Finally, the juxtaposition of “man” against “machine” has a long tradition (Jones, 2013) and progress in the106
fields of robotics and artificial intelligence (Silver et al., 2018; Brown et al., 2020) not only in the context of107
manual but also cognitive tasks and health care is causing a surge of “automation anxiety” (Feigenbaum and108
Gross, 2020) in developed countries. In this sense, AI-based automation of teaching is a key-issue in current109
debates on technology and education (Selwyn, 2021), as already a growing industry is dedicated to the task110
of automating essay scoring (Foltz et al., 2020). At the same time, automated essay evaluation might be111
considered as an opportunity, not only to support teachers but also for neutral assessments in contrast to112
human evaluation (Alvero et al., 2021).113
In the following, we investigate first the general predictive power of man, machine and gene on cognitive114
ability as well as non-cognitive traits which have been highlighted complementary for educational success115
of students during childhood, at the same or close to the time both the essays have been written and the116
teachers provided their assessments of the students’ abilities (age 11). Next, we quantify the predictability of117
educational attainment at age 33 by all three predictors separately, incrementally, and jointly and compare118
those results with alternative measures from the scientific literature. Finally, in particular in the context of119
the man and machine comparison, we investigate both predictive measures of the assessments and mediating120
5
and confounding factors respectively of their association with educational attainment.121
Results122
Essays, Teacher Assessments and Genes predict Cognitive and non-cognitive Traits123
First, we find quite substantial predictive power of both teacher evaluation and our essay-based deep learning124
(DL) algorithm: For cognitive ability at age 11 66% (64%-68% over all cross-validation folds, see Figure 2125
A) of individual differences in a general factor of cognitive ability can be predicted by teacher assessments,126
and 61% (58%-64% over all cross-validation folds) by essay-based DL. More specifically, for reading (Man =127
Comparable prediction of Man, Machine, Genes and cognitive, non-cognitive skills.134
All three approaches predict educational attainment (see Fig. 2 B). Again, teacher assessment shows the135
highest prediction (29%), followed by DL (25%) and PGS predictions (19%). Combining teachers’ evaluation136
and the essay-based prediction only marginally improves the human prediction by 7% (2 percentage points,137
pp), but additional information from the teacher improves the DL prediction one quarter (6 pp). Polygenic138
prediction adds substantially to man (24%; 7 pp) and DL prediction (28%; 7 pp). Incrementally to the two139
other approaches, each man (19%; 6 pp), machine (6%; 2 pp) and gene (23%; 7 pp) add information to the140
prediction.141
The joint overall prediction of 38% is indeed comparable to the prediction based on cognitive abilities measured142
at age 11 as well as non-cognitive abilities (Fig. 2 C). The Figure also includes birth weight as a predictor143
since this has long been a biological proxy also in the social sciences associated with positive life course144
6
Figure 2: A: 10-Fold Cross-Validated Predictive R2 from SuperLearner-models based on essays, teacherassessments and genomic data for cognitive ability and non-cognitive traits. Whiskers mark lowest and highestR2 over all CV-Folds. B: Improvement of predictive R2 for combining Essays, Teacher Assessments andPolygenic Scores compared to each of the three Baseline-Predictions. C: 10-Fold Cross-Validated predictiveR2 from SuperLearner-Models based on Cognitive Ability, non-cognitive Traits, Birthweight and Height forhighest attained education at age 33. Whiskers mark lowest and highest R2 over all CV-Folds.
7
outcomes as well as body height, which is an example from social psychology explaining socio-economic145
success (Stulp, 2013). However, those measures only predict education by 1% and 3%. Even one of the146
key sociological predictors of educational attainment, parental education, is only able to explain 12% of the147
variance.148
Cognitive Ability, Noncognitive Traits and Parental SES mediate the Prediction of Essays,149
Teacher Assessments and Genes on Educational Attainment.150
What pathways drive the association from Man, Machine and Gene to educational attainment? Next to151
race, parental socio-economic status (SES) and sex are potential discriminatory factors in school assessment152
(Campbell, 2015), beyond cognitive and non-cognitive skills which teachers are supposed to evaluate. They153
might also be caught by our DL algorithm from textual cues in the essays. Furthermore, recent research154
has shown, that polygenic scores for educational attainment capture both direct as well as indirect effects,155
the latter potentially being confounded by parental SES (Wang et al., 2021). To better understand the156
signals picked up by our nonstandard predictors, we fit a multiple mediation model including cognitive ability,157
(TAALES, Kyle et al., 2018) and sentiment (SEANCE, Crossley et al., 2017), 31 metrics of readability and232
grammatical and typographical error/word-ratios.233
Genomic Data234
For multiple subsets of NCDS participants genotyped data was available. We combined all available genomic235
data into a single file and restricted the available SNPs to those in common with the 1000 genomes reference236
panel. The final sample contains 37 772 588 variants on 6437 individuals. Using PRSice2 (Choi and O’Reilly,237
2019), we then applied publicly available summary statistics of genome-wide association studies to construct238
polygenic scores for a set of 33 curated traits (see appendix) likely to show associations with the outcomes in239
question, which span the realms of cognition, mental health, personality, physical composure, social behavior240
and substance abuse.241
Analytical Strategy242
SuperLearner243
To guarantee that we extract the maximum predictive validity from the three data sources of interest (teacher244
assessments, essays, genomic data), we use a so-called SuperLearner (Van der Laan et al., 2007; Polley and245
Van Der Laan, 2010) approach an ensembling algorithm that estimates the performance of a selection of246
machine learning models for predicting the analyzed outcomes based on cross-validation. From these, the247
SuperLearner determines a weighted average based on their performance. We run the SuperLearner with248
nested cross validation, using both 10 folds in the inner and the outer loop with the MSE as the cost function249
using the SuperLearner package in R.250
Input Algorithms251
We use a diverse range of state-of-the-art machine-learning models as input to the SuperLearner: Extreme252
Gradient Boosting, as implemented in the package xgboost (Chen et al., 2015), RandomForest, as implemented253
in the package ranger (Wright and Ziegler, 2015), a shallow neural network, as implemented in the package254
nnet, SVM with a Radial Basis Kernel, as implemented in the package ksvm, a simple linear regression model255
12
and the mean of the outcome in the training set. As an extremely large number of variables are extracted256
from the essays in the form of word embeddings, etc. the LASSO (implemented in glmnet) is used as a257
pre-screening algorithm to limit computation time.258
Evaluation Metric259
We focus on R2Holdout (or predictive R2) as our main evaluation criterion (Salganik et al., 2020), which can260
be viewed as the predictive equivalent of the well-known coefficient of variation R2. It rescales the squared261
error of a model’s prediction by the squared error when the prediction is only based on the mean value in the262
training set. For each cross-validation fold we compute263
R2Holdout = 1 −
∑i∈Holdout(yi − yi)2∑
i∈Holdout(yi − yT raining)2 .
Analogous to R2, a value of R2Holdout = 1 implies a perfect prediction, while R2
Holdout ≤ 0 indicates that the264
model performs worse than using just the sample mean for prediction.2265
2Five extreme outliers (> 10 SD) were identified in a single fold of a single trait (neuroticism) and set to the mean prediction.This does not influence results in any qualitative way but changes a misleading and not interpretable minimum fold R2
Holdout of-216 to 0.0024.
13
References266
Polina Shafran Abramov and Roman V Yampolskiy. Automatic iq estimation using stylometric methods. In267
Handbook of Research on Learning in the Age of Transhumanism, pages 32–45. IGI Global, 2019.268
AJ Alvero, Sonia Giebel, Ben Gebre-Medhin, Anthony Lising Antonio, Mitchell L Stevens, and Benjamin W269
Domingue. Essay content and style are strongly related to household income and SAT scores: Evidence270
from 60,000 undergraduate applications. Science advances, 7(42):eabi9031, 2021.271
Franz Baeriswyl, Christian Wandeler, and Ulrich Trautwein. «Auf einer anderen schule oder bei einer anderen272
lehrkraft hätte es für’s gymnasium gereicht»: Eine untersuchung zur bedeutung von schulen und lehrkräften273
für die übertrittsempfehlung. Zeitschrift für pädagogische Psychologie, 25(1):39–47, 2011.274
Richard Berk, Drougas Berk, and Drougas. Machine Learning Risk Assessments in Criminal Justice Settings.275
Springer, 2019.276
Fabrizio Bernardi and Gabriele Ballarino. Education, Occupation and Social Origin: A Comparative Analysis277
of the Transmission of Socio-Economic Inequalities. Edward Elgar Publishing, 2016.278
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind279
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.280
Advances in neural information processing systems, 33:1877–1901, 2020.281
Tammy Campbell. Stereotyped at seven? Biases in teacher judgement of pupils’ ability and attainment.282
Journal of Social Policy, 44(3):517–547, 2015.283
Gary L Canivez and Marley W Watkins. Long-term stability of the wechsler intelligence scale for Chil-284