Explaining ‘learning’ in MOOCs: A regression analysis of the empirical data from the first edX courses

Page 1 of 15

Explaining ‘learning’ in MOOCs: A regression

analysis of the empirical data from the first

edX courses. Navroop K. Sahdev

Erasmus Mundus Masters EPOG

University of Paris XIII

April 12, 2015

Abstract

The current research explores factors that explain learning outcomes in Massive Open

Online Courses (MOOCs) by using regression analysis where ‘certification’ and ‘grade’

are recognized as the key learning indicators. Given the diversity of registrants that

engage in MOOCs with an equally diverse set of motivations, a single indicator can be

counterproductive in understanding learning outcomes of MOOCs. The study

henceforth, analyzes two set of models - each regressed on an individual registrant’s

highest level of educational attainment and engagement in the course. The theoretical

explanation rests on the notion of ‘absorptive capacity’1 that plays a key role in

imparting learning. At the same time, the study discovers a clear positive impact

of active engagement in the course in the learning outcomes. The data used is

from edX, on its first 17 courses.

Keywords

MOOCs, learning, regression.

Contents

1. Introduction

2. The research question

3. Methodology

4. Limitations of the study

5. The dataset

6. The variables

1 The understanding of the idiosyncratic nature of knowledge from the Arrovian understanding of its

limited appropriability, non divisibility and stickiness has led to the huge body of work in Knowledge Economics which emphasizes absorptive capacity of the receipt – whether individuals or firms – to absorb and generate new knowledge. Without opening a new research agenda on ‘absorptive capacity’, the current research uses the term as to this general understanding of the notion. For further study, see Schumpeter (1942), Nelson (1959), Penrose (1959), Arrow (1962), Polanyi (1958 and 1966) and Antonelli (2007).

Page 2 of 15

7. How can learning be measured? A survey of the relevant literature.

8. Descriptive Statistics

9. Generalized Linear Model (GLM) Regression

a. The choice of the model

b. Regression results

10. Conclusion

11. References

12. Appendix: Source code in R

Introduction

Speculators around the world - in media and academia alike - have taken strong

positions on the impact that Massive Open Online Courses (MOOCs) can have on the

delivery of higher education, either calling it a revolution which will throw universities out

of business or calling them a failed phenomenon by now (as they didn’t outcaste

universities after all) or simply shrugging them off as no substitute for conventional

education (delivered by universities). The key question to analyze the impact of

MOOCs, is of course whether MOOCs are an efficient way of learning or not. This

invariably requires measuring learning outcomes through the MOOCs data. The

question is a critical one - for in the age of digitalization, knowledge and expertise

imparted online, may serve as the key to the eventual democratization of education,

while imparting critical skills to registrants.

The research question

The current paper seeks to find out which variables play a key role in learning outcomes

in MOOCs. The paper identified four independent variables from the available dataset

and two dependent variables. The theoretical justification of each can be understood

from the sections that follow.

Methodology

Regression analysis is undertaken to analyze the impact of previous educational

attainment and engagement in the course on final grade and certification of registrants.

Various regression models are fitted to understand these dynamics. The software used

for descriptive statistics, regression analysis and graphical presentations is R, with the

exception of some plots which were generated through other means. It should be noted

that the data is already ‘trimmed’ where the outliers (in this case, individuals with

unusually high activity in any course) have already been removed2. This serves as one

2 This has primarily been done to avoid easy identification of these individuals (which is possible through

their posts on discussion forums and social media) as the data is governed by the Family Educational

Rights and Privacy Act (FERPA) which protects the privacy of the registrant records.

Page 3 of 15

of the key strengths of the data as the extreme values are expected not to impact the

results unnecessarily, hence yielding more robust estimates.

Limitations of the study

The only dataset available on MOOCs (used for this study) should be considered a

subset at best (albeit the large size) of the available courses during the first year of edX

- which limits its usage for drawing still broader conclusions. A key feature of the dataset

is that the courses under study are largely graduate level courses; hence registrants

with only undergraduate training are expected to have lower engagement levels and

eventual learning outcomes in the course. This is confirmed by the regression results.

Similarly, explanatory variables explain very little in terms of learning outcomes

(certification and grades) for registrants who had a Doctorate as their highest degree,

again consistent with the fact that the courses on which the data is available are largely

graduate level (probably rendering doctoral registrants overqualified for them).

The dataset

The data set used is the first of its kind jointly released by Harvard and MIT “HarvardX-

MITx Person-Course Dataset AY2013” in May, 2014 - the first (de-identified) data

available on MOOCs. The dataset consists of 641,138 observations of 20 variables. It is

at the level of one row per-person, per-course. This means that if one person is enrolled

in three MITx or HarvardX courses during the period covered by the dataset for

example, that person would have three rows associated with their user ID. (The

HarvardX-MITx Person-Course Dataset AY2013, 2014) A total of 17 courses were

provided on the edX platform.

The variables3

There are two kinds of variables in the dataset: administrative (variables that come from

the edX system or computed by the research team) and user-provided (variables that

come from questions asked by edX of the registrant at the time of registration)

The following is a brief summary of the relevant variables for the current analysis:

1. viewed: administrative, binary; anyone who accessed ‘courseware’ tab within the

edX platform of the course.

2. explored: administrative, binary; anyone who accessed at least half of the

chapters in the courseware.

3. certified: administrative, binary; anyone who earned a certificate.

4. country: mix of administrative (computed from IP address) and user-provided

(filled in from registrant address)

3 Source: The HarvardX-MITx Person-Course Dataset AY2013 documentation, May 27, 2014.

Page 4 of 15

5. edu: user-provided, highest level of education completed. Possible values: ‘less

than Secondary’, ‘Secondary’, ‘Bachelors’, ‘Masters’, ‘Doctorate’.

6. grade: administrative, final grade in the course, ranges from 0 to 1.

7. nevents: administrative, number of interactions with the course.

8. ndays_act: administrative, number of unique days registrant interacted with the

course.

9. nplay_video: administrative, number of play video events within the course.

10. nchapters: administrative, number of chapters with which the registrant

interacted.

Registrants have to take separate assessments for certification which are independent

of the video lectures/courseware they access. Accordingly, the course metrics –

nchapters, nevents and ndays_act is an independent measure of a registrant’s

overall activity, including the number of clicks while undertaking assessments and more

importantly – a registrant’s activity in the social forums. The paper argues that course

metrics are an important explanatory variable of the learning outcomes of MOOCs.

Haggard (2013) argues that in fact, the literacies and skills required to benefit from

MOOCs are very specific, and existing educational curricula may be unsuited for them.

(Haggard, 2013) This poses interesting questions about the sustainability, equality and

accessibility of MOOCs. The obvious next question is: Are MOOCs a positive

disruption to Higher Education after all? The preceding question a researcher has to

answer in order to access the impact of MOOCs on institutions of higher education as

well as learner is: What are MOOCs able to achieve? An in-depth analysis of the

learning MOOCs foster is undertaken in the following section accordingly.

How can learning be measured? A survey of the relevant literature.

The aim of this paper is to analyze the variables that explain learning outcomes in

online learning platforms like MOOCs. This in turn is critical in order to understand the

added value of MOOCs as the newest platform for learning. For measurement

purposes, two variables ‘grade’ and ‘certification’ which account for learning outcomes

are available to the researcher. The quest for better and more comprehensive indicators

is an ongoing one. Particularly with the very first 17 courses that are currently being

analyzed, various factors like registrant motivation (whether they want to earn a

certificate or simply plan to learn new content by accessing the video lectures),

assessment exercises embedded within the video lectures and other such advanced

metrics were not developed. Hence, it remains a challenging task to account for

different registrant motivations for measurement purposes. There is as yet no agreed

satisfactory system of measurement for assessing the quality of MOOCs from the

Page 5 of 15

learners’ point of view. (Haggard, 2013) Indeed, one has to necessarily rely on grades

and certification data available.

Within the literature on Economics of Education and Sociology, a vast body of work is

available on whether or not grades are a good measure of learning for traditional

delivery of education. Marks et al. (2010) undertake a study combining instructor

specific grades with a common high stakes post test that is centrally graded which

together form ‘student learning’. This measure in turn is combined with student –

specific course evaluations. They conclude that “clarify of the instructor, use of

supplementary material and overall course experience are the strongest predictors of

knowledge acquisition.” (Marks, 2010) Hence, a combination of standard tests while

accounting for registrant heterogeneity can provide the key to measuring learning in

education.

Hadsell and MacDermott in their “Faculty Perceptions of Grades: Results from a

National Survey of Economics Faculty” present the findings that 41 per cent of

economic faculty agreed or strongly agreed that ‘Students’ concern about grades often

interferes with learning in my classroom.’ The catch here is that grades do not have to

be completely removed, simply de-emphasized. (Hadsell, 2009)

Online education (and distance education) presents the added complication of low

retention rates that are evident due to very low or no cost of education, hence providing

little incentive for registrants to retain interest in the courses. Once again, registrant

motivation becomes the most important variable to decipher these patterns. The

registrants who choose to attend a MOOC tend to be both independent and flexible.

(Huxley G. a., 2014)

For the purposes of the current study, the working assumption is that learning can be

measured to a good extent by grades and certification as long as adequate attention is

paid to registrants with a different set of motivations than certification to using MOOCs.

This opens a fertile area of research where newer and more comprehensive data

available on MOOCs can be of ready aid.

Descriptive Statistics

In order to have an overall sense of the dataset, this section provides a brief over the

descriptive statistics of key indicators, accompanied by visual depictions, wherever

relevant.4

4 For more detailed descriptive statistics, refer to the report “HarvardX and MITx: The First Year of Open

Online Courses Fall 2012-Summer 2013”.

Page 6 of 15

The dataset identifies four subpopulations of interest within each course, to account for

the differences in registrant activity:

a. Only registered: Registrants who never accessed the courseware.

b. viewed: Non-certified registrants who access the courseware, accessing less

than half of the chapters.

c. explored: Non-certified registrants who access more than half of the available

chapters in the courseware.

d. certified: Registrants who earn a certificate in the course.

Figure 1: Four Mutually Exclusive and Exhaustive Categories of Course

Registrants

Source: HarvardX and MITx: ‘The First Year of Open Online Courses’ report, 2014.

Figure 1 depicts that out of the total number of registrants, those who viewed the course

form a subset. Out of the registrants who viewed the courseware, two further subsets of

explored and certified can be witnessed. There is a significant overlap between the two;

but there are mutually exclusive categories of those who only explored the course

without getting certified and still others who got certified, never exploring the

courseware (they might have viewed the courseware though).

Figure 2: Distribution of highest level of educational achievement, by course.

Page 7 of 15


Another crucial variable under study is the highest level of educational attainment of

registrants. Figure 2 depicts educational levels in individual courses providing a bigger

picture view of the registrants’ level of educational achievement, as reported by them

during registration.

Figure 3: The top 25 countries, by numbers of registrants, for all HarvardX and

MITx registrants.

Page 8 of 15


A third important variable (which is not used for regression analysis) is the geographical

location of the registrants. Since the courses were offered in English, countries with a

large number of English speakers are expected to attract greater number of registrants -

this is an obvious bias in the data. U.S. accounts for 28% of the total registrants.

Similarly, India which has a huge English speaking population accounts for 13.2% of the

total number of registrants.

Figure 4: Distributions of course activity (in terms of the percentage of chapters

accessed) and course grades (for grades above 1%, linearly adjusted across

courses to a common certification cutoff of 60%).


Figure 4 demonstrates how course activity metrics (like number of events/clicks,

number of chapters accessed, number of active days) have great explanatory power in

understanding learning outcomes. The bottom left and the top right quadrants stand out

in terms of the density of registrants, as low grade is associated with low level of course

Page 9 of 15

activity (percentage of chapters explored in this case) and high grade is associated with

high course activity.

These broad Statistics make the ground fertile for regression analysis undertaken in the

following section to understand the factors affecting ‘learning’ in depth.

Generalized Linear Model (GLM) Regression

This section undertakes regression analysis for an in depth understanding of the factors

learning outcomes – measured by certification and grade in this study - depend on. A

total of 8 models are presented.

In words, the regression analysis regresses learning outcomes on previous education

and course metrics:

Learning outcome = f (previous education + course metrics)

The theoretical justification of this model rests on the literature on education and

knowledge acquisition where absorptive capacity plays a key role in acquiring new

knowledge. This is reflected through previous highest level of educational attainment.

Additionally, the engagement in the course forms an intuitive control variable

(constituting three variables, in fact). Course engagement builds greater absorptive

capacity by imparting new learning through video lectures and knowledge sharing

through discussion forums and social media, hence entering as important control

variables in the regression equations.

Specifically,

Independent variables: ‘edu’, ‘nevents’, ‘ndays_act’, ‘nchapters’.

Dependent variables: ‘certified’, ‘grade’.

The choice of the model

Ordinary Least Squares regression is undertaken to analyze linear models of

continuous variables. However, the dependent variable ‘certified’ is a binary

variable and ‘grade’ is measured in terms of probability (between 0 and 1) in the

current study. Hence, Generalized Linear Model (GLM) is used rather than an ordinary

Linear Model (LM). Residual deviance serves as the test for goodness of fit for these

probit regressions. Lower the value of Residual deviance, the better the model

fits the data.

Regression models, results and interpretation.

Page 10 of 15

Table 1 summarizes the results of the first set of models with ‘certified’ explained by

various dependent variables. A quick look at the table depicts that only one category

(from the ‘edu’ variable) is significant (***) across all models: Masters; while nevents,

nchapter and ndays_act are significant all across. The group of registrants with

Bachelors and Secondary education becomes a significant explanatory variable only

when course metrics are controlled for (Model 2.3 and 2.4). The declining Residual

deviance points out towards better fits, with model 4 as the best fit. The intercept

captures the combined impact of previous level of educational achievement and

engagement in the course and is not very useful for the current analysis.

Model 1: certified ~ edu Model 2: certified ~ edu + nevents Model 3: certified ~ edu + nevents + nchapters Model 4: certified ~ edu + nevents + nchapters + ndays_act

Table 1: ‘certified’ as the learning outcome

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Table 2 summarizes the results of the second set of models with ‘grade’ regressed on

‘edu’, ‘nevents’, ‘nchapters’ and ‘ndays_act’ in progression. Again, only one category

(from the ‘edu’ variable) is significant(***) across all models: Masters; while nevents,

nchapter and ndays_act are significant all across. The declining Residual deviance

points out towards better fits, with model 4 as the best fit.

Page 11 of 15

Notice that model 4 of this set of models has the lowest Residual deviance among all

models5. This goes on to say the grade is best explained when the previous level of

education is up to a Masters degree while controlling for all course metrics which reflect

registrant engagement in the courseware (nevents, nchapters and ndays_act). Again,

the group of registrants with Bachelors and Secondary education becomes a significant

explanatory variable only when course metrics are controlled for (Model 2.3 and 2.4).

This can be interpreted as: Secondary and Bachelors registrants who take active role in

the course are able to achieve higher grades through this training. This proves to be a

very interesting result – how controlling for course metrics is what renders the Bachelors

and Secondary education categories significant!

The results are arguably less intuitive where previous level of educational achievement

is not a linear explanatory variable. This means that higher the level of education of a

registrant, does not mean higher the grades they achieve (with education as a

categorical variable). On the contrary, only registrants with Masters as their highest

level of educational attainment are best placed in terms of their absorptive capacity, with

or without controlling for course metrics. The coefficient is in fact negative for those who

hold a doctorate after controlling for course metrics.

Model 1: grade ~ edu Model 2: grade ~ edu + nevents Model 3: grade ~ edu + nevents + nchapters Model 4: grade ~ edu + nevents + nchapters + ndays_act

Table 2: ‘grade’ as the learning outcome

5 This might also explain why HarvardX and MITx: ‘The First Year of Open Online Courses’ report chooses to rely

on ‘grade’ as a more important variable than ‘certified’ as depicted in Figure 4. Statistically, this model is the best fit.

Page 12 of 15

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Conclusion

The study confirms the maintained hypothesis that grade and certification is explained

by previous levels of educational achievement and the level of engagement of the

registrants in the course. Where higher levels of engagement are associated with higher

grades and certification, the level of education of the registrants depict a particular

pattern. Registrants who have Masters as their higher level of educational achievement

are the most significant group of learners in the dataset, even when their course

engagement is not controlled for. Econometrically, registrants with Bachelors and

Secondary education as their highest level of educational achievement are the second

most significant set of learners, conditional to their engagement in the course. In other

words, learning is fostered in undergraduate students through engagement in the

course and this is reflected in their grades and certification. Course engagement

matters.

As is clear from the models presented above, regular intuition does not apply where

learning outcomes can be assumed to sbe a linear function of educational achievement.

Learning outcomes on the other hand, are significantly related to attainment up to

Masters level, but beyond that (doctorate), they are not significantly explained by

previous levels of education.

Page 13 of 15

Secondly, course metrics - number of events (counted by tracking logs), number of

chapters accessed and number of unique days spent - are all significant indicators of

learning outcomes. This sends a strong message that engagement in the course

matters in explaining learning outcomes. This is good news for MOOCs providers

reflecting that the content delivered through these online platforms are an added value

to the registrants who undertake them. Even though there are exceptional cases of

registrants faring well in the assessments without much engagement in the courses,

these cases seem to reflect a very small number of the total. On an average, course

metrics are very important explanatory variables of learning outcomes – grade and

certification.

Future research themes may study the different motivations of registrants (which are

accounted for in the current registration to MOOCs in the form of a preliminary question

to course registration) while using MOOCs and integrating this heterogeneity in

analyzing learning outcomes of MOOCs. The availability of such data is the key to begin

to analyze the impact of MOOCs on the current higher education market and its role in

knowledge acquisition.

References Barber, M. a. (2013). An avalanche is coming: Higher education and the revolution ahead.

Institute for Public Policy Research.

Brynjolfsson, E. a. (2011). Goodbye Pareto Principle, Hello Long Tail: The effect of search costs

on the concentration of product sales. Management Science , 32.

Dellarocas, C. (2013). Money models for MOOCs: Considering new business models for

massive open online courses. Communications of the ACM .

Globalization and Higher Education. (2002). UNESCO's First Global Forum on International

Quality Assurance, Accreditation and the Recognition of QUalifications in Higher Education.

Paris: UNESCO.

Hadsell, L. a. (2009). Faculty Perceptions of Grades: Results from a National Survey of

Economics Faculty. International Review of Economics Education , 20.

Haggard, S. (2013). The Maturing of the MOOC. London: Department for Business Innovation

and Skill.

Ho, A. a. (2014). HarvardX and MITx: The frst year of open online courses. HarvardX Research

Committee and the Office of the Digital Learning at MIT.

Huxley, G. a. (2014). An Economic Model of Learning Styles. Bristol: THE CENTRE FOR

MARKET AND PUBLIC ORGANISATION.

Huxley, G. (n.d.). An Economic Model of Learning Styles. p. 51.

Liyanagunawardena, T. (2013). MOOCs: A Systematic Study of the Published Literature 2008-

2012. The International Review of Research in Open and Distance Learning , 26.

Page 14 of 15

Marks, M. a. (2010). Do Course Evaluations reflect student learning? Evidence from a Pre-

test/Post test setting. 29.

Norton, A. (2013). The Unbundling and Re-bundling of Higher Education. 10.

The HarvardX-MITx Person-Course Dataset AY2013. (2014, May).

Vardi, M. Y. (2012, Novermber ). Will MOOCs Destroy Academia? Communications of the ACM

.

Appendix 1: Source code in R

Tools > Import dataset > From Text File

viewed=data$viewed

explored=data$explored

certified=data$certified

country=data$final_cc_cname_DI

edu=data$LoE_DI

grade=data$grade

nevents=data$nevents

ndays_act=data$ndays_act

nchapters=data$nchapters

reg=glm(certified~edu)

model1.1=glm(certified~edu)

summary(model1.1)

reg=glm(certified~edu+nevents)

model1.2=glm(certified~edu+nevents)

summary(model1.2)

reg=glm(certified~edu+nevents+nchapters)

model1.3=glm(certified~edu+nevents+nchapters)

summary(model1.3)

reg=glm(certified~edu+nevents+nchapters+ndays_act)

model1.4=glm(certified~edu+nevents+nchapters+ndays_act)

summary(model1.4)

reg=glm(grade~edu)

model2.1=glm(grade~edu)

summary(model2.1)

reg=glm(grade~edu+nevents)

model2.2=glm(grade~edu+nevents)

summary(model2.2)

reg=glm(grade~edu+nevents+nchapters)

model2.3=glm(grade~edu+nevents+nchapters)

summary(model2.3)

Page 15 of 15

reg=glm(grade~edu+nevents+nchapters+ndays_act)

model2.4=glm(grade~edu+nevents+nchapters+ndays_act)

summary(model2.4)

Explaining ‘learning’ in MOOCs: A regression analysis of the empirical data from the first edX courses

Documents