Page 1 of 15 Explaining ‘learning’ in MOOCs: A regression analysis of the empirical data from the first edX courses. Navroop K. Sahdev Erasmus Mundus Masters EPOG University of Paris XIII April 12, 2015 Abstract The current research explores factors that explain learning outcomes in Massive Open Online Courses (MOOCs) by using regression analysis where ‘certification’ and ‘grade’ are recognized as the key learning indicators. Given the diversity of registrants that engage in MOOCs with an equally diverse set of motivations, a single indicator can be counterproductive in understanding learning outcomes of MOOCs. The study henceforth, analyzes two set of models - each regressed on an individual registrant’s highest level of educational attainment and engagement in the course. The theoretical explanation rests on the notion of ‘absorptive capacity’ 1 that plays a key role in imparting learning. At the same time, the study discovers a clear positive impact of active engagement in the course in the learning outcomes. The data used is from edX, on its first 17 courses. Keywords MOOCs, learning, regression. Contents 1. Introduction 2. The research question 3. Methodology 4. Limitations of the study 5. The dataset 6. The variables 1 The understanding of the idiosyncratic nature of knowledge from the Arrovian understanding of its limited appropriability, non divisibility and stickiness has led to the huge body of work in Knowledge Economics which emphasizes absorptive capacity of the receipt – whether individuals or firms – to absorb and generate new knowledge. Without opening a new research agenda on ‘absorptive capacity’, the current research uses the term as to this general understanding of the notion. For further study, see Schumpeter (1942), Nelson (1959), Penrose (1959), Arrow (1962), Polanyi (1958 and 1966) and Antonelli (2007).
15
Embed
Explaining ‘learning’ in MOOCs: A regression analysis of the empirical data from the first edX courses
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1 of 15
Explaining ‘learning’ in MOOCs: A regression
analysis of the empirical data from the first
edX courses. Navroop K. Sahdev
Erasmus Mundus Masters EPOG
University of Paris XIII
April 12, 2015
Abstract
The current research explores factors that explain learning outcomes in Massive Open
Online Courses (MOOCs) by using regression analysis where ‘certification’ and ‘grade’
are recognized as the key learning indicators. Given the diversity of registrants that
engage in MOOCs with an equally diverse set of motivations, a single indicator can be
counterproductive in understanding learning outcomes of MOOCs. The study
henceforth, analyzes two set of models - each regressed on an individual registrant’s
highest level of educational attainment and engagement in the course. The theoretical
explanation rests on the notion of ‘absorptive capacity’1 that plays a key role in
imparting learning. At the same time, the study discovers a clear positive impact
of active engagement in the course in the learning outcomes. The data used is
from edX, on its first 17 courses.
Keywords
MOOCs, learning, regression.
Contents
1. Introduction
2. The research question
3. Methodology
4. Limitations of the study
5. The dataset
6. The variables
1 The understanding of the idiosyncratic nature of knowledge from the Arrovian understanding of its
limited appropriability, non divisibility and stickiness has led to the huge body of work in Knowledge Economics which emphasizes absorptive capacity of the receipt – whether individuals or firms – to absorb and generate new knowledge. Without opening a new research agenda on ‘absorptive capacity’, the current research uses the term as to this general understanding of the notion. For further study, see Schumpeter (1942), Nelson (1959), Penrose (1959), Arrow (1962), Polanyi (1958 and 1966) and Antonelli (2007).
Page 2 of 15
7. How can learning be measured? A survey of the relevant literature.
8. Descriptive Statistics
9. Generalized Linear Model (GLM) Regression
a. The choice of the model
b. Regression results
10. Conclusion
11. References
12. Appendix: Source code in R
Introduction
Speculators around the world - in media and academia alike - have taken strong
positions on the impact that Massive Open Online Courses (MOOCs) can have on the
delivery of higher education, either calling it a revolution which will throw universities out
of business or calling them a failed phenomenon by now (as they didn’t outcaste
universities after all) or simply shrugging them off as no substitute for conventional
education (delivered by universities). The key question to analyze the impact of
MOOCs, is of course whether MOOCs are an efficient way of learning or not. This
invariably requires measuring learning outcomes through the MOOCs data. The
question is a critical one - for in the age of digitalization, knowledge and expertise
imparted online, may serve as the key to the eventual democratization of education,
while imparting critical skills to registrants.
The research question
The current paper seeks to find out which variables play a key role in learning outcomes
in MOOCs. The paper identified four independent variables from the available dataset
and two dependent variables. The theoretical justification of each can be understood
from the sections that follow.
Methodology
Regression analysis is undertaken to analyze the impact of previous educational
attainment and engagement in the course on final grade and certification of registrants.
Various regression models are fitted to understand these dynamics. The software used
for descriptive statistics, regression analysis and graphical presentations is R, with the
exception of some plots which were generated through other means. It should be noted
that the data is already ‘trimmed’ where the outliers (in this case, individuals with
unusually high activity in any course) have already been removed2. This serves as one
2 This has primarily been done to avoid easy identification of these individuals (which is possible through
their posts on discussion forums and social media) as the data is governed by the Family Educational
Rights and Privacy Act (FERPA) which protects the privacy of the registrant records.
Page 3 of 15
of the key strengths of the data as the extreme values are expected not to impact the
results unnecessarily, hence yielding more robust estimates.
Limitations of the study
The only dataset available on MOOCs (used for this study) should be considered a
subset at best (albeit the large size) of the available courses during the first year of edX
- which limits its usage for drawing still broader conclusions. A key feature of the dataset
is that the courses under study are largely graduate level courses; hence registrants
with only undergraduate training are expected to have lower engagement levels and
eventual learning outcomes in the course. This is confirmed by the regression results.
Similarly, explanatory variables explain very little in terms of learning outcomes
(certification and grades) for registrants who had a Doctorate as their highest degree,
again consistent with the fact that the courses on which the data is available are largely
graduate level (probably rendering doctoral registrants overqualified for them).
The dataset
The data set used is the first of its kind jointly released by Harvard and MIT “HarvardX-
MITx Person-Course Dataset AY2013” in May, 2014 - the first (de-identified) data
available on MOOCs. The dataset consists of 641,138 observations of 20 variables. It is
at the level of one row per-person, per-course. This means that if one person is enrolled
in three MITx or HarvardX courses during the period covered by the dataset for
example, that person would have three rows associated with their user ID. (The
HarvardX-MITx Person-Course Dataset AY2013, 2014) A total of 17 courses were
provided on the edX platform.
The variables3
There are two kinds of variables in the dataset: administrative (variables that come from
the edX system or computed by the research team) and user-provided (variables that
come from questions asked by edX of the registrant at the time of registration)
The following is a brief summary of the relevant variables for the current analysis:
1. viewed: administrative, binary; anyone who accessed ‘courseware’ tab within the
edX platform of the course.
2. explored: administrative, binary; anyone who accessed at least half of the
chapters in the courseware.
3. certified: administrative, binary; anyone who earned a certificate.
4. country: mix of administrative (computed from IP address) and user-provided
(filled in from registrant address)
3 Source: The HarvardX-MITx Person-Course Dataset AY2013 documentation, May 27, 2014.
Page 4 of 15
5. edu: user-provided, highest level of education completed. Possible values: ‘less
than Secondary’, ‘Secondary’, ‘Bachelors’, ‘Masters’, ‘Doctorate’.
6. grade: administrative, final grade in the course, ranges from 0 to 1.
7. nevents: administrative, number of interactions with the course.
8. ndays_act: administrative, number of unique days registrant interacted with the
course.
9. nplay_video: administrative, number of play video events within the course.
10. nchapters: administrative, number of chapters with which the registrant
interacted.
Registrants have to take separate assessments for certification which are independent
of the video lectures/courseware they access. Accordingly, the course metrics –
nchapters, nevents and ndays_act is an independent measure of a registrant’s
overall activity, including the number of clicks while undertaking assessments and more
importantly – a registrant’s activity in the social forums. The paper argues that course
metrics are an important explanatory variable of the learning outcomes of MOOCs.
Haggard (2013) argues that in fact, the literacies and skills required to benefit from
MOOCs are very specific, and existing educational curricula may be unsuited for them.
(Haggard, 2013) This poses interesting questions about the sustainability, equality and
accessibility of MOOCs. The obvious next question is: Are MOOCs a positive
disruption to Higher Education after all? The preceding question a researcher has to
answer in order to access the impact of MOOCs on institutions of higher education as
well as learner is: What are MOOCs able to achieve? An in-depth analysis of the
learning MOOCs foster is undertaken in the following section accordingly.
How can learning be measured? A survey of the relevant literature.
The aim of this paper is to analyze the variables that explain learning outcomes in
online learning platforms like MOOCs. This in turn is critical in order to understand the
added value of MOOCs as the newest platform for learning. For measurement
purposes, two variables ‘grade’ and ‘certification’ which account for learning outcomes
are available to the researcher. The quest for better and more comprehensive indicators
is an ongoing one. Particularly with the very first 17 courses that are currently being
analyzed, various factors like registrant motivation (whether they want to earn a
certificate or simply plan to learn new content by accessing the video lectures),
assessment exercises embedded within the video lectures and other such advanced
metrics were not developed. Hence, it remains a challenging task to account for
different registrant motivations for measurement purposes. There is as yet no agreed
satisfactory system of measurement for assessing the quality of MOOCs from the
Page 5 of 15
learners’ point of view. (Haggard, 2013) Indeed, one has to necessarily rely on grades
and certification data available.
Within the literature on Economics of Education and Sociology, a vast body of work is
available on whether or not grades are a good measure of learning for traditional
delivery of education. Marks et al. (2010) undertake a study combining instructor
specific grades with a common high stakes post test that is centrally graded which
together form ‘student learning’. This measure in turn is combined with student –
specific course evaluations. They conclude that “clarify of the instructor, use of
supplementary material and overall course experience are the strongest predictors of
knowledge acquisition.” (Marks, 2010) Hence, a combination of standard tests while
accounting for registrant heterogeneity can provide the key to measuring learning in
education.
Hadsell and MacDermott in their “Faculty Perceptions of Grades: Results from a
National Survey of Economics Faculty” present the findings that 41 per cent of
economic faculty agreed or strongly agreed that ‘Students’ concern about grades often
interferes with learning in my classroom.’ The catch here is that grades do not have to
be completely removed, simply de-emphasized. (Hadsell, 2009)
Online education (and distance education) presents the added complication of low
retention rates that are evident due to very low or no cost of education, hence providing
little incentive for registrants to retain interest in the courses. Once again, registrant
motivation becomes the most important variable to decipher these patterns. The
registrants who choose to attend a MOOC tend to be both independent and flexible.
(Huxley G. a., 2014)
For the purposes of the current study, the working assumption is that learning can be
measured to a good extent by grades and certification as long as adequate attention is
paid to registrants with a different set of motivations than certification to using MOOCs.
This opens a fertile area of research where newer and more comprehensive data
available on MOOCs can be of ready aid.
Descriptive Statistics
In order to have an overall sense of the dataset, this section provides a brief over the
descriptive statistics of key indicators, accompanied by visual depictions, wherever
relevant.4
4 For more detailed descriptive statistics, refer to the report “HarvardX and MITx: The First Year of Open
Online Courses Fall 2012-Summer 2013”.
Page 6 of 15
The dataset identifies four subpopulations of interest within each course, to account for
the differences in registrant activity:
a. Only registered: Registrants who never accessed the courseware.
b. viewed: Non-certified registrants who access the courseware, accessing less
than half of the chapters.
c. explored: Non-certified registrants who access more than half of the available
chapters in the courseware.
d. certified: Registrants who earn a certificate in the course.
Figure 1: Four Mutually Exclusive and Exhaustive Categories of Course
Registrants
Source: HarvardX and MITx: ‘The First Year of Open Online Courses’ report, 2014.
Figure 1 depicts that out of the total number of registrants, those who viewed the course
form a subset. Out of the registrants who viewed the courseware, two further subsets of
explored and certified can be witnessed. There is a significant overlap between the two;
but there are mutually exclusive categories of those who only explored the course
without getting certified and still others who got certified, never exploring the
courseware (they might have viewed the courseware though).
Figure 2: Distribution of highest level of educational achievement, by course.
Page 7 of 15
Source: HarvardX and MITx: ‘The First Year of Open Online Courses’ report, 2014.
Another crucial variable under study is the highest level of educational attainment of
registrants. Figure 2 depicts educational levels in individual courses providing a bigger
picture view of the registrants’ level of educational achievement, as reported by them
during registration.
Figure 3: The top 25 countries, by numbers of registrants, for all HarvardX and
MITx registrants.
Page 8 of 15
Source: HarvardX and MITx: ‘The First Year of Open Online Courses’ report, 2014.
A third important variable (which is not used for regression analysis) is the geographical
location of the registrants. Since the courses were offered in English, countries with a
large number of English speakers are expected to attract greater number of registrants -
this is an obvious bias in the data. U.S. accounts for 28% of the total registrants.
Similarly, India which has a huge English speaking population accounts for 13.2% of the
total number of registrants.
Figure 4: Distributions of course activity (in terms of the percentage of chapters
accessed) and course grades (for grades above 1%, linearly adjusted across
courses to a common certification cutoff of 60%).
Source: HarvardX and MITx: ‘The First Year of Open Online Courses’ report, 2014.
Figure 4 demonstrates how course activity metrics (like number of events/clicks,
number of chapters accessed, number of active days) have great explanatory power in
understanding learning outcomes. The bottom left and the top right quadrants stand out
in terms of the density of registrants, as low grade is associated with low level of course
Page 9 of 15
activity (percentage of chapters explored in this case) and high grade is associated with
high course activity.
These broad Statistics make the ground fertile for regression analysis undertaken in the
following section to understand the factors affecting ‘learning’ in depth.
Generalized Linear Model (GLM) Regression
This section undertakes regression analysis for an in depth understanding of the factors
learning outcomes – measured by certification and grade in this study - depend on. A
total of 8 models are presented.
In words, the regression analysis regresses learning outcomes on previous education
and course metrics:
Learning outcome = f (previous education + course metrics)
The theoretical justification of this model rests on the literature on education and
knowledge acquisition where absorptive capacity plays a key role in acquiring new
knowledge. This is reflected through previous highest level of educational attainment.
Additionally, the engagement in the course forms an intuitive control variable
(constituting three variables, in fact). Course engagement builds greater absorptive
capacity by imparting new learning through video lectures and knowledge sharing
through discussion forums and social media, hence entering as important control
Ordinary Least Squares regression is undertaken to analyze linear models of
continuous variables. However, the dependent variable ‘certified’ is a binary
variable and ‘grade’ is measured in terms of probability (between 0 and 1) in the
current study. Hence, Generalized Linear Model (GLM) is used rather than an ordinary
Linear Model (LM). Residual deviance serves as the test for goodness of fit for these
probit regressions. Lower the value of Residual deviance, the better the model
fits the data.
Regression models, results and interpretation.
Page 10 of 15
Table 1 summarizes the results of the first set of models with ‘certified’ explained by
various dependent variables. A quick look at the table depicts that only one category
(from the ‘edu’ variable) is significant (***) across all models: Masters; while nevents,
nchapter and ndays_act are significant all across. The group of registrants with
Bachelors and Secondary education becomes a significant explanatory variable only
when course metrics are controlled for (Model 2.3 and 2.4). The declining Residual
deviance points out towards better fits, with model 4 as the best fit. The intercept
captures the combined impact of previous level of educational achievement and
engagement in the course and is not very useful for the current analysis.
Model 1: certified ~ edu Model 2: certified ~ edu + nevents Model 3: certified ~ edu + nevents + nchapters Model 4: certified ~ edu + nevents + nchapters + ndays_act