Top Banner
Using machine learning to identify the most at-risk students in physics classes Jie Yang, 1 Seth DeVore, 1 Dona Hewagallage, 1 Paul Miller , 1 Qing X. Ryan , 2 and John Stewart 1,* 1 Department of Physics and Astronomy, West Virginia University, Morgantown, West Virginia 26506, USA 2 Department of Physics and Astronomy, California State Polytechnic University, Pomona, California 91768, USA (Received 28 July 2020; accepted 29 September 2020; published 28 October 2020) Machine learning algorithms have recently been used to predict studentsperformance in an introductory physics class. The prediction model classified students as those likely to receive an A or B or students likely to receive a grade of C, D, F or withdraw from the class. Early prediction could better allow the direction of educational interventions and the allocation of educational resources. However, the performance metrics used in that study become unreliable when used to classify whether a student would receive an A, B, or C (the ABC outcome) or if they would receive a D, F or withdraw (W) from the class (the DFW outcome) because the outcome is substantially unbalanced with between 10% to 20% of the students receiving a D, F, or W. This work presents techniques to adjust the prediction models and alternate model performance metrics more appropriate for unbalanced outcome variables. These techniques were applied to three samples drawn from introductory mechanics classes at two institutions (N ¼ 7184, 1683, and 926). Applying the same methods as the earlier study produced a classifier that was very inaccurate, classifying only 16% of the DFW cases correctly; tuning the model increased the DFW classification accuracy to 43%. Using a combination of institutional and in-class data improved DFW accuracy to 53% by the second week of class. As in the prior study, demographic variables such as gender, underrepresented minority status, first-generation college student status, and low socioeconomic status were not important variables in the final prediction models. DOI: 10.1103/PhysRevPhysEducRes.16.020130 I. INTRODUCTION Physics courses, along with other core science and mathematics courses, form key hurdles for science, tech- nology, engineering, and mathematics (STEM) students early in their college career. Student success in these classes is important to improving STEM retention; the success of students traditionally underrepresented in STEM disci- plines in the core classes may be a limiting factor in increasing inclusion in STEM fields. Physics education research (PER) has developed a wide range of research- based instructional materials and practices to help students learn physics [1]. Research-based instructional strategies have been demonstrated to increase student success and retention [2]. While some of these strategies are easily implemented for large classes, others have substantial implementation costs. Further, no class could implement all possible research-based strategies, and some may be more appropriate for some subsets of students than for others. One method to better distribute resources to the students who would benefit the most is to identify at-risk students early in physics classes. The effective identifica- tion of students at risk in physics classes and the efficacious uses of this classification represents a promising new research strand in PER. The need for STEM graduates continues to increase at a rate that is outstripping STEM graduation rates across American institutions. A 2012 report from the Presidents Council of Advisors on Science and Technology [3] identified the need to increase graduation of STEM majors to avoid a projected shortfall of one million STEM job candidates over the next decade. Improving STEM reten- tion has long been an important area of investigation for science education researchers [411]. Targeting interven- tions to students at risk in core introductory science and mathematics courses taken early in college offers one potential mechanism to improve STEM graduation rates. In recent years, educational data mining has become a prominent method of analyzing student data to inform course redesign and to predict student performance and persistence [1216]. * [email protected] Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published articles title, journal citation, and DOI. PHYSICAL REVIEW PHYSICS EDUCATION RESEARCH 16, 020130 (2020) 2469-9896=20=16(2)=020130(14) 020130-1 Published by the American Physical Society
14

Using machine learning to identify the most at-risk ...

May 19, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using machine learning to identify the most at-risk ...

Using machine learning to identify the most at-risk students in physics classes

Jie Yang,1 Seth DeVore,1 Dona Hewagallage,1 Paul Miller ,1

Qing X. Ryan ,2 and John Stewart 1,*

1Department of Physics and Astronomy, West Virginia University,Morgantown, West Virginia 26506, USA

2Department of Physics and Astronomy, California State Polytechnic University,Pomona, California 91768, USA

(Received 28 July 2020; accepted 29 September 2020; published 28 October 2020)

Machine learning algorithms have recently been used to predict students’ performance in an introductoryphysics class. The prediction model classified students as those likely to receive an A or B or students likelyto receive a grade of C, D, F or withdraw from the class. Early prediction could better allow the direction ofeducational interventions and the allocation of educational resources. However, the performance metricsused in that study become unreliable when used to classify whether a student would receive an A, B, or C(the ABC outcome) or if they would receive a D, F or withdraw (W) from the class (the DFW outcome)because the outcome is substantially unbalanced with between 10% to 20% of the students receiving a D, F,or W. This work presents techniques to adjust the prediction models and alternate model performancemetrics more appropriate for unbalanced outcome variables. These techniques were applied to threesamples drawn from introductory mechanics classes at two institutions (N ¼ 7184, 1683, and 926).Applying the same methods as the earlier study produced a classifier that was very inaccurate, classifyingonly 16% of the DFW cases correctly; tuning the model increased the DFW classification accuracy to 43%.Using a combination of institutional and in-class data improved DFWaccuracy to 53% by the second weekof class. As in the prior study, demographic variables such as gender, underrepresented minority status,first-generation college student status, and low socioeconomic status were not important variables in thefinal prediction models.

DOI: 10.1103/PhysRevPhysEducRes.16.020130

I. INTRODUCTION

Physics courses, along with other core science andmathematics courses, form key hurdles for science, tech-nology, engineering, and mathematics (STEM) studentsearly in their college career. Student success in these classesis important to improving STEM retention; the success ofstudents traditionally underrepresented in STEM disci-plines in the core classes may be a limiting factor inincreasing inclusion in STEM fields. Physics educationresearch (PER) has developed a wide range of research-based instructional materials and practices to help studentslearn physics [1]. Research-based instructional strategieshave been demonstrated to increase student success andretention [2]. While some of these strategies are easilyimplemented for large classes, others have substantialimplementation costs. Further, no class could implement

all possible research-based strategies, and some may bemore appropriate for some subsets of students than forothers. One method to better distribute resources to thestudents who would benefit the most is to identify at-riskstudents early in physics classes. The effective identifica-tion of students at risk in physics classes and the efficacioususes of this classification represents a promising newresearch strand in PER.The need for STEM graduates continues to increase at a

rate that is outstripping STEM graduation rates acrossAmerican institutions. A 2012 report from the President’sCouncil of Advisors on Science and Technology [3]identified the need to increase graduation of STEM majorsto avoid a projected shortfall of one million STEM jobcandidates over the next decade. Improving STEM reten-tion has long been an important area of investigation forscience education researchers [4–11]. Targeting interven-tions to students at risk in core introductory science andmathematics courses taken early in college offers onepotential mechanism to improve STEM graduation rates.In recent years, educational data mining has become aprominent method of analyzing student data to informcourse redesign and to predict student performance andpersistence [12–16].

*[email protected]

Published by the American Physical Society under the terms ofthe Creative Commons Attribution 4.0 International license.Further distribution of this work must maintain attribution tothe author(s) and the published article’s title, journal citation,and DOI.

PHYSICAL REVIEW PHYSICS EDUCATION RESEARCH 16, 020130 (2020)

2469-9896=20=16(2)=020130(14) 020130-1 Published by the American Physical Society

Page 2: Using machine learning to identify the most at-risk ...

The current study investigates the application of machinelearning algorithms to identify at-risk students. Machinelearning and data science as a whole are growing explo-sively in many segments of the economy as these newmethods are used to make sense and exploit the exponen-tially growing data collected in an increasing online world.These methods are also being adapted to understand andimprove educational data systems. It seems likely that thisprocess will accelerate in the near future as universities, in achallenging financial climate, attempt to retain as manystudents as possible. We argue that PER should both helpshape the construction of retention models of physicsstudents and explore their most effective and most ethicaluse. The following summarizes the prior study applyingeducation data mining (EDM) techniques in physicsclasses, provides an overview of EDM, and more specifi-cally an overview of the use of EDM for grade prediction.

A. Prior study: Study 1

This study extends the results of Zabriskie et al. [17]which will be referred to as study 1 in this work. Study 1used institutional data such as ACT scores and college GPA(CGPA) as well as data collected within a physics classsuch as homework grades and test scores to predict whethera student would receive an A or B in the first and secondsemester of a calculus-based physics class at a largeuniversity. The study used both logistic regression andrandom forests to classify students. Random forest classi-fication using only institutional variables was 73% accuratefor the first semester class. This accuracy increased to 80%by the fifth week of the class when in-class variables wereincluded. The logistic regression and random forest clas-sification algorithms generated very similar results. Study 1chose to predict A and B outcomes, rather than the moreimportant A, B, and C outcomes, partially because thesample was significantly unbalanced. Sample imbalancemakes classification accuracy more difficult to interpret.Study 1 investigated the effect of a number of demographicvariables [gender, underrepresented minority (URM) sta-tus, and first-generation status] on grade prediction andfound they were not important to grade classification.These groups (women, underrepresented minority students,and first-generation students) were very underrepresentedin the sample studied; it was unclear to what extent the lowimportance of the demographic variables was caused by thedemographic imbalance of the sample.

B. Research questions

This study seeks to extend the application of machinelearning algorithms to predict whether a student will earn aD or F or withdraw (W) from a physics class. In particular,we explore the following research questions.

RQ1: How can machine learning algorithms be appliedto predict an unbalanced outcome in a physics class?

RQ2: Does classification accuracy differ for underre-presented groups in physics? If so, how and why doesit differ?

RQ3: How can the results of a machine learning analysisbe applied to better understand and improve physicsinstruction?

C. Educational data mining

Educational data mining can be described as the use ofstatistical, machine learning, and traditional data miningmethods to draw conclusions from large educational data-sets while incorporating predictive modeling and psycho-metric modeling [16]. In a 2014 meta-analysis of 240 EDMarticles by Peña-Ayala, 88% of the studies were found touse a statistical and/or machine learning approach to drawconclusions from the data presented. Of these studies, 22%analyzed student behavior, 21% examined student perfor-mance, and 20% examined assessments [18]. Peña-Ayalaalso found that classification was the most common methodused in EDM applied in 42% of all analyses, with clusteringused in 27%, and regression used in 15% of studies.Educational data mining encompasses a large number of

statistical and machine learning techniques with logisticregression, decision trees, random forests, neural networks,naive Bayes, support vector machines, and K-nearest neigh-bor algorithms commonly applied [19]. Peña-Ayala’s [18]analysis found 20% of studies employed Bayes theorem and18%decision trees. Decision trees and random forests are oneof themore commonly used techniques inEDM.Weuse thesetechniques to investigate our research questions and exploreways to assess the success of machine learning algorithms.More information on the fundamentals of these and othermachine learning techniques are readily available through anumber of machine learning texts [20,21].

D. Grade prediction and persistence

While EDM is used for a wide array of purposes, it hasoften been used to examine student performance andpersistence. One survey by Shahiri et al. summarized 30studies in which student performance was examined usingEDM techniques [22]. Neural networks and decision treeswere the two most common techniques used in studies ex-amining student performance with naive Bayes, K-nearestneighbors, and support vector machines used in somestudies. A study by Huang and Fang examined studentperformance on the final exam for a large-enrollmentengineering course using measurements of college GPA,performance in three prerequisite math classes as well asPhysics 1, and student performance on in-semester exami-nations [23]. They analyzed the data using a largenumber of techniques commonly used in EDM and foundrelatively little difference in the accuracy of the resultingmodels. Study 1 also found little difference in the per-formance of machine learning algorithms in predictingphysics grades. Another study examining an introductory

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-2

Page 3: Using machine learning to identify the most at-risk ...

engineering course by Marbouti et al. used an array ofEDM techniques to predict student grade outcomes of C orbetter [24]. They used in-class measures of student per-formance including homework, quiz, and exam 1 scoresand found that logistic regression provided the highestaccuracy at 94%. A study by Macfadyen and Dawsonattempted to identify students at risk of failure in anintroductory biology course [25]. Using logistic regressionthey were able to identify students failing (defined ashaving a grade of less than 50%) with 81% accuracy. Withthe goal of improving STEM retention, many universitiesare taking a rising interest in using EDM techniques forgrade and persistence prediction in STEM classes [26].The use of machine learning techniques in physics

classes has only begun recently. In addition to study 1,random forests were used in a 2018 study by Aiken et al. topredict student persistence as physics majors and identifythe factors that are predictive of students either remainingphysics majors or becoming engineering majors [27].

II. METHODS

A. Sample

This study used three samples drawn from the intro-ductory calculus-based physics classes at two institutions.Samples 1 and 2 were collected in the introductory,

calculus-based mechanics course (Physics 1) taken byphysical science and engineering students at a largeEastern land-grant university (Institution 1) serving approx-imately 21 000 undergraduate students. The general uni-versity undergraduate population had ACT scores rangingfrom 21 to 26 (25th–75th percentile) [28]. The overallundergraduate demographics were 80% White, 4%Hispanic, 6% international, 4% African American, 4%students reporting two or more races, 2% Asian, and othergroups each with 1% or less [28].Sample 1 was drawn from institutional records and

includes all students who completed Physics 1 from2000 to 2018, for a sample size of 7184. Over the periodstudied, the instructional environment of the course variedwidely, and as such, the result for this sample may be robustto pedagogical variations. Prior to the Spring 2011 semes-ter, the course was presented traditionally with multipleinstructors teaching largely traditional lectures and studentsperforming cookbook laboratory exercises. In Spring 2011,the department implemented a learning assistant (LA)program [29] using the Tutorials in Introductory Physics[30]. In Fall 2015, the program was modified because of afunding change with LAs assigned to only a subset oflaboratory sections. The tutorials were replaced with opensource materials [31] which lowered textbook costs tostudents and allowed full integration of the research-basedmaterials with laboratory activities.Sample 2 was collected from the Fall 2016 to the Spring

2019 semesterwhen the instructional environmentwas stable,

for a sample size of 1683. The same institutional data werecollected and the sample also included a limited numberof in-class performancemeasures: clicker average, homeworkaverage, Force and Motion Conceptual Evaluation (FMCE)pretest score, FMCEpretest participation, and the score on in-semester examinations. A more detailed explanation of thesevariables will be provided in the next section.Sample 3 was collected at a primarily undergraduate and

Hispanic-serving university (Institution 2) with approxi-mately 26 000 students in the western U.S. Fifty percent ofthe general undergraduate population had ACT scores inthe range 19 to 27. The demographics of the generalundergraduate population were 46% Hispanic, 21% Asian,16% White, 6% international, 4% two or more races, 3%African American, 3% unknown, with other races 1% orless [28]. The sample was collected in the introductorycalculus-based mechanics class for all four quarters of the2017 calendar year. This class also primarily servesphysical science and engineering students. The coursewas taught in multiple sections each quarter with multipledifferent instructors. The pedagogical style varied greatlywith some instructors giving traditional lectures and someteaching using active-learning methods.

B. Variables

The variables used in this study were drawn frominstitutional records and from data collected within theclasses and are shown in Table I. Two types of variableswere used: two-level dichotomous variables and continu-ous variables. A few variables require additional explan-ation. The variable CalReady measures the student’s mathreadiness. Calculus 1 is a prerequisite for Physics 1. For thevast majority of students in Physics 1, the student’s four-year degree plans assume the student enrolls in Calculus 1their first semester at the university. These students areconsidered “math ready.” A substantial percentage of thestudents at Institution 1 are not math ready. The variableSTEMCls captures the number of STEM classes completedbefore the start of the course studied. STEM classes includemathematics, biology, chemistry, engineering, and physicsclasses.For all samples, demographic information was also

collected from institutional records. Students were consid-ered first generation if neither of their parents completed afour-year degree. A student was classified as an under-represented minority student (URM) if they identified asHispanic or reported a race other than White or Asian.Gender was also collected from university records; for theperiod studied gender was recorded as a binary variable.While not optimal, this reporting is consistent with the useof gender in most studies in PER; for a more nuanceddiscussion of gender and physics, see Traxler et al. [32].For sample 2, in-class data were also available on a

weekly basis. These data included clicker scores (given forparticipation points), homework averages, test scores, and a

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-3

Page 4: Using machine learning to identify the most at-risk ...

conceptual pretest score (PreScore) using the FMCE [33].Students not in attendance on the day the FMCE was givenreceived a zero; whether students completed the FMCEwascaptured by the dichotomous variable (PreTaken) which isone if the test was taken, zero otherwise.For sample 3, socioeconomic status (SES) was measured

by whether the students qualified for a federal Pell grant.A student is eligible for a Pell grant if their family income isless than $50 000 U.S. dollars; however, most Pell grantsare awarded to students with family incomes less than$20 000 [34].

C. Random forest classification models

This work employs the random forests machine learningalgorithm to predict students’ final grade outcomes inintroductory physics. Random forests are one of manymachine learning classification algorithms. Study 1reported that most machine learning algorithms had similarperformance when predicting physics grades. A classifica-tion algorithm seeks to divide a dataset into multipleclasses. This study will classify students as those who willreceive an A, B, or C (ABC students) and students who willreceive a D or F or withdraw (W) (DFW students).To understand the performance of a classification algo-

rithm, the dataset is first divided into test and trainingdatasets. The training dataset is used to develop theclassification model, to train the classifier. The test dataset

is then used to characterize the model performance. Theclassification model is used to predict the outcome of eachstudent in the test dataset; this prediction is compared tothe actual outcome. Section II D discusses performancemetrics used to characterize the success of the classificationalgorithm. For this work, 50% of the data were included inthe test dataset and 50% in the training dataset. This splitwas selected to maintain a substantial number of under-represented students in both the test and training datasets.The random forest algorithm uses decision trees, another

machine learning classification algorithm. Decision treeswork by splitting the dataset into two or more subgroupsbased on one of the model variables. The variable selectedfor each split is chosen to divide the dataset into the twomost homogeneous subsets of outcomes possible, that is,subsets with a high percentage of one of the two classi-fication outcomes. The variable and the threshold for thevariable represents the decision for each node in the tree.For example, one node may split the dataset using thecriteria (the decision) that a student’s college GPA is lessthan 3.2. The process continues by splitting the subsetsforming the decision tree until each node contains only oneof the two possible outcomes. Decision trees are lesssusceptible to multicollinearity than many statistical meth-ods common in PER such as linear regression [35].Random forests extend the decision tree algorithm by

growing many trees instead of a single tree. The “forest” ofdecision trees is used to classify each instance in the data;

TABLE I. Full list of variables.

Sample

Variable 1 2 3 Type Description

Institutional variables

Gender × × × Dichotomous Does the student identify as a man or a women?URM × × × Dichotomous Does the student identify as an underrepresented minority?FirstGen × × × Dichotomous Is the student a first-generation college student?CalReady × × Dichotomous Is the student ready for calculus?SES × Dichotomous Does the student qualify for a Pell grant?CmpPct × × Continuous Percentage of credit hours attempted that were completed.CGPA × × × Continuous College GPA at the start of the course.STEMCls × × Continuous Number of STEM classes completed at the start of the course.HrsCmp × × Continuous Total credit hours earned at the start of the course.HrsEnroll × × Continuous Current credit hours enrolled at the start of the course.HSGPA × × × Continuous High school GPA.ACTM × × × Continuous ACT or SAT mathematics percentile score.ACTV × × Continuous ACT or SAT verbal percentile score.APCredit × × Continuous Number of credit hours received from AP courses.TransCrd × × Continuous Number of credit hours received from transfer courses.

In-class variablesClicker × Continuous Average clicker score graded for participation.Homework × Continuous Homework average.TestAve × Continuous Average for the first or the first and second exam.Pretest participation × Dichotomous Was the pretest taken?Pretest score × Continuous FMCE pretest score.

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-4

Page 5: Using machine learning to identify the most at-risk ...

each tree “votes” on the most probable outcome. Thedecision threshold determines what fraction of the treesmust vote for the outcome for the outcome to be selected asthe overall prediction of the random forest. Random forestsuse bootstrapping to prevent one variable from beingobscured by another variable. Bootstrapping is a statisticalmethod where multiple random subsets of a dataset arecreated by sampling with replacement. Individual trees aregrown on Z subsamples generated by sampling the trainingdataset with replacement using a subset of size m ¼ ffiffiffi

kp

ofthevariables,where k is the number of independent variablesin the model [36]. This method ensures the trees are notcorrelated and that the stronger variables do not overwhelmweaker variables [20]. The “randomForest” package in “R”was used for the analysis. The Supplemental Materialcontains an example of random forest code in R [37].

D. Performance metrics

The confusion matrix [38] as shown in Table II summa-rizes the results of a classification algorithm and is the basisfor calculating most model performance metrics. To con-struct the confusion matrix, the classification model devel-oped from the training dataset is used to classify students inthe test dataset. The confusion matrix categorizes theoutcome of this classification.For classification, one of the dichotomous outcomes is

selected as the positive result. In the current study, we usethe DFW outcome as the positive result. This choice wasmade because some of the model performance metricsfocus on the positive results and we feel that mostinstructors would be more interested in accurately identi-fying students at risk of failure.From the confusion matrix, many performance metrics

can be calculated. Study 1 reported the overall classifica-tion accuracy, the fraction of correct predictions, shown inEq. (1):

overall accuracy ¼ TNþ TPNtest

; ð1Þ

where Ntest ¼ TPþ TNþ FPþ FN is the size of the testdataset.The true positive rate (TPR) and the true negative rate

(TNR) characterize the rate of making accurate predictionsof either the DFWor the ABC outcome. The DFWaccuracyis the fraction of the actual DFW cases that are classified asDFW in the test dataset:

DFWaccuracy ¼ TPR ¼ TPTPþ FN

: ð2Þ

ABC accuracy is the fraction of the actual ABC cases thatare classified as ABC:

ABC accuracy ¼ TNR ¼ TNTNþ FP

: ð3Þ

DFWaccuracy is called “sensitivity” or “recall” in machinelearning; ABC accuracy is “specificity.”ABC and DFW accuracy can be adjusted by changing

the strictness of the classification criteria. If the modelclassifies even the only slightly promising cases as DFW, itwill probably classify most actual DFW cases as DFWproducing a high DFW accuracy. It will also make a lot ofmistakes; the DFW precision or the positive predictivevalue (PPV) captures the rate of making correct predictionsand is defined as the fraction of the DFW predictions whichare correct:

DFWprecision ¼ PPV ¼ TPTPþ FP

: ð4Þ

DFW precision is called “precision” or “positive predictivevalue” in machine learning.This study seeks models that balance DFWaccuracy and

precision; however, the correct balance for a given appli-cation must be selected based on the individual features ofthe situation. If there is little cost and no risk to anintervention, then optimizing for higher DFW accuracymight be the correct choice to identify as many DFWstudents as possible. If the intervention is expensive orcarries risk, optimizing the DFW precision so that moststudents who are given the intervention are actually at riskmight be more appropriate.Beyond simply evaluating the overall performance of a

classification algorithm, we would like to establish howmuch better the algorithm performs than pure guessing. Forexample, sample 1 is substantially unbalanced between theDFW and ABC outcomes with 88% of the studentsreceiving an A, B, or C. If a classification method guessedthat all student would receive an A, B, or C, then theclassifier would have an overall accuracy of 0.88; therefore,overall accuracy would not be a useful metric to character-ize model performance in this case.In order to provide a more complete picture of

model performance, additional performance metrics wereexplored. Cohen’s kappa κ measures agreement amongobservers [39] correcting for the effect of pure guessing as

κ ¼ p0 − pe

1 − pe; ð5Þ

where p0 is the observed agreement and pe is agreement bychance. Fit criteria have been developed for κ with κ less

TABLE II. Confusion matrix.

Actual negative Actual positive

Predicted negative True negative (TN) False negative (FN)Predicted positive False positive (FP) True positive (TP)

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-5

Page 6: Using machine learning to identify the most at-risk ...

than 0.2 as poor agreement, 0.2–0.4 fair agreement, 0.4–0.6moderate agreement, 0.6–0.8 good agreement, and 0.8–1.0excellent agreement between observers [40].The receive operating characteristic (ROC) curve (origi-

nally developed to evaluate radar) plots the true positiverate against the false positive rate (FPR). The area under thecurve (AUC) is a measure of the model’s discriminationbetween the two outcomes; AUC is the integrated areaunder the ROC curve. For a classifier that uses pureguessing, the ROC curve is a straight line between (0,0)and (1,1) and the AUC is 0.5. An AUC of 1.0 representsperfect discrimination [38,41]. Hosmer et al. [41] suggestan AUC threshold of 0.80 for excellent discrimination.

E. Model tuning and validation

We will find that the random forest classification modelshave poor performance predicting whether a student willreceive a D, F, or W using the default parameters of themodel. To improve performance, the models are tuned byadjusting the decision threshold. The imbalance of both theoutcome variable and some of the demographic variables

must also be investigated to verify that the models are validand the conclusions are reliable. This process is describedin detail the Supplemental Material [37].

III. RESULTS

General descriptive statistics are shown in Tables III andIV for samples 1 and 3, respectively. The descriptivestatistics for sample 2 are similar to sample 1 and arepresented in the Supplemental Material [37]. The dichoto-mous outcome variable divides each sample into twosubsets with different academic characteristics. Thedichotomous independent variables further divide the sub-sets defined by the outcome variables. The overall dem-ographic composition of the sample is shown for eachsample in the Supplemental Material [37].

A. Classification models

To explore the classification of DFW students, multipleclassification models were constructed for each sample. Toallow comparison, each model was tuned so that the DFW

TABLE III. Descriptive statistics for sample 1. All values are the mean � the standard deviation.

N Physics grade ACT math (%) HSGPA CGPA

Overall 7184 2.70� 1.3 79� 14 3.71� 0.5 3.18� 0.5

ABC students 6337 3.05� 0.8 80� 14 3.75� 0.4 3.25� 0.5DFW students 847 0.05� 0.9 73� 15 3.43� 0.5 2.65� 0.5

Women 1270 2.83� 1.2 79� 14 3.94� 0.4 3.38� 0.5Men 5914 2.67� 1.3 79� 14 3.66� 0.5 3.14� 0.5

URM 388 2.42� 1.3 73� 17 3.53� 0.5 3.03� 0.6Not URM 6796 2.71� 1.3 80� 14 3.72� 0.5 3.19� 0.5

First generation 815 2.66� 1.3 77� 15 3.72� 0.5 3.15� 0.5Not first generation 6369 2.70� 1.3 80� 14 3.71� 0.5 3.18� 0.5

TABLE IV. Descriptive statistics for sample 3. All values are the mean � the standard deviation.

N Physics grade SAT math (%) HSGPA CGPA

Overall 926 2.34� 1.2 75� 18 3.66� 0.4 3.10� 0.6

ABC students 740 2.83� 0.8 77� 17 3.70� 0.3 3.20� 0.5DFW students 186 0.39� 0.5 68� 19 3.49� 0.4 2.70� 0.5

Women 259 2.21� 1.2 71� 19 3.70� 0.3 3.13� 0.5Men 667 2.39� 1.2 77� 17 3.64� 0.4 3.09� 0.6

URM 396 2.13� 1.3 68� 19 3.64� 0.4 3.02� 0.6Not URM 530 2.49� 1.2 81� 14 3.67� 0.3 3.16� 0.5

First generation 440 2.18� 1.2 70� 19 3.63� 0.4 3.03� 0.6Not first generation 486 2.49� 1.2 80� 15 3.68� 0.3 3.16� 0.6

Low SES 351 2.26� 1.2 71� 19 3.65� 0.4 3.06� 0.6Not Low SES 575 2.39� 1.2 78� 16 3.67� 0.3 3.12� 0.6

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-6

Page 7: Using machine learning to identify the most at-risk ...

accuracy and DFW precision were approximately equal.Table V shows the overall model fit for all samples. Eachsample is discussed separately.

1. Sample 1

Sample 1 was first analyzed using the default decisionthreshold for the randomForest package in R where 50% ofthe trees must vote for the outcome to be selected. This wasthe threshold used in study 1. This result is shown as the“Default”model in Table V. The model has very poor DFWaccuracy with only 16% of the DFW students identified. Italso has fairly poor κ and AUC. This poor performanceresults from the unbalanced DFWoutcomewhere only 12%of the students receive a D, F, or W. This model was tunedto produce the “Overall” model by adjusting the decisionthreshold as shown in the Supplemental Material [37].A threshold of 32% of trees voting for the DFW classi-fication produced the Overall model which balanced DFWaccuracy and precision. This model substantially improvedDFW accuracy to 43% at the expense of lower DFWprecision and had substantially better κ and AUC; κ ¼ 0.36represented fair agreement; however, the AUC value of0.68 was well below Hosmer’s threshold of 0.80 forexcellent discrimination.The classification model constructed on the full training

dataset was then used to classify each demographic sub-group in the test dataset to determine if a model trained on a

sample composed predominantly of majority studentswould be accurate for other students. The κ and AUC ofthe models classifying women, URM students, and first-generation students were very similar. Some, but notextreme, variation was measured for DFW accuracy andprecision. The overall classifier had lower DFW accuracyfor women and higher accuracy for URM students (withcorresponding changes in precision). This may indicate thatit would be productive to tune the models separately fordifferent demographic groups.Finally, the model labeled “Restricted” was constructed

using only a subset of variables similar to those availablefor sample 3. Sample 3 contained institutional variables thatare commonly supplied with a demographic data request toinstitutional records; Sample 1 also included variables suchas STEMCls which may be of particular interest forprediction of the outcomes of physics students and vari-ables such as the percentage of classes completed that maybe of particular importance in DFW classification. As onemight expect, the restricted model using fewer variablesperformed more weakly than the overall model with DFWaccuracy reduced by 7%.

2. Sample 2

Sample 2 contained the same institutional variables assample 1, but also included in-class data such as homeworkgrades and clicker grades which were available on a weekly

TABLE V. Model performance parameters. Values represent the mean � the standard deviation.

Model Overall accuracy DFW accuracy ABC accuracy DFW precision κ AUC

Sample 1 (N ¼ 7184)

Default 0.89� 0.00 0.16� 0.02 0.98� 0.00 0.57� 0.04 0.21� 0.02 0.57� 0.01Overall 0.87� 0.01 0.43� 0.02 0.93� 0.01 0.44� 0.02 0.36� 0.02 0.68� 0.01Female students 0.90� 0.01 0.38� 0.05 0.96� 0.01 0.49� 0.06 0.37� 0.05 0.67� 0.03URM students 0.80� 0.02 0.48� 0.07 0.86� 0.02 0.40� 0.06 0.32� 0.06 0.67� 0.04First-generation students 0.87� 0.01 0.44� 0.06 0.92� 0.01 0.42� 0.06 0.35� 0.05 0.68� 0.03Restricted 0.85� 0.01 0.36� 0.02 0.91� 0.01 0.36� 0.02 0.28� 0.02 0.64� 0.01

Sample 2 (N ¼ 1683)

Institutional 0.90� 0.01 0.50� 0.05 0.95� 0.01 0.50� 0.04 0.45� 0.04 0.73� 0.02In-class only week 1 0.88� 0.01 0.37� 0.05 0.94� 0.02 0.38� 0.05 0.31� 0.04 0.65� 0.02Institutional and in-class week 1 0.91� 0.01 0.53� 0.05 0.95� 0.01 0.53� 0.04 0.48� 0.04 0.74� 0.02In-class only week 2 0.89� 0.01 0.42� 0.05 0.94� 0.01 0.43� 0.05 0.36� 0.04 0.68� 0.02Institutional and in-class week 2 0.91� 0.01 0.56� 0.05 0.95� 0.01 0.55� 0.04 0.51� 0.04 0.76� 0.02In-class only week 5 0.92� 0.01 0.54� 0.06 0.95� 0.01 0.54� 0.05 0.49� 0.04 0.74� 0.03Institutional and in-class week 5 0.93� 0.01 0.59� 0.05 0.96� 0.01 0.60� 0.05 0.55� 0.04 0.78� 0.04In-class only week 8 0.94� 0.01 0.66� 0.05 0.96� 0.01 0.65� 0.05 0.62� 0.04 0.81� 0.03Institutional and in-class week 8 0.94� 0.01 0.68� 0.05 0.97� 0.01 0.68� 0.04 0.65� 0.04 0.82� 0.02

Sample 3 (N ¼ 926)

Overall 0.74� 0.02 0.37� 0.05 0.84� 0.03 0.37� 0.03 0.21� 0.04 0.61� 0.02Female students 0.70� 0.02 0.40� 0.08 0.79� 0.04 0.38� 0.05 0.19� 0.06 0.60� 0.03URM students 0.67� 0.03 0.41� 0.09 0.76� 0.05 0.37� 0.05 0.16� 0.06 0.58� 0.04First-generation students 0.72� 0.02 0.45� 0.07 0.80� 0.03 0.43� 0.04 0.25� 0.06 0.63� 0.03Low SES students 0.72� 0.03 0.35� 0.09 0.82� 0.05 0.36� 0.06 0.17� 0.07 0.58� 0.04

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-7

Page 8: Using machine learning to identify the most at-risk ...

basis. While the institutional data would require a datarequest to institutional research at most institutions, the in-class variables should be available to most physics instruc-tors. Table V shows the progression of DFW accuracy andprecision as the class progresses.A model using only the institutional variables was first

constructed to determine how well DFW students could beidentified using only variables available before the semesterbegins. This model (Institutional) had superior performancecharacteristics to the overallmodel of sample 1whichused thesame variables and a larger sample collected over a longertime period. The improved performance quite possibly wasthe result of sample 1 averaging over many instructionalenvironments while sample 2 contained data from a singleinstructional design. This suggests that limiting the data usedfor the classifier to the current implementation of a coursemay produce superior results, even with lower sample size.The performance of models using only the in-class data

easily available to instructors consistently performed moreweakly than those which mixed in-class and institutionaldata. The in-class-only models improved as the classprogressed and became better than the model includingonly institutional data after the first test was given inweek 5. The in-class-only model was substantially betterthan the institutional model after the second test was givenin week 8. As such, if the goal of a classification algorithmis to predict student outcomes well into the class, onlyin-class data are needed.The models combining in-class and institutional data

added surprisingly little predictive power to the institutionalmodel, particularly early in the class. This further supportsthe need to access a rich set of institutional data for accurateclassification early in a class and suggests predictions madeusing only institutional data will not be substantiallymodified using in-class data until the first test is given.

3. Sample 3

As shown in Table I, sample 3 contains many fewervariables than sample 1. The classification model forsample 3 had lower DFW accuracy and precision thansimilar models for samples 1 and 2. Restricting the variableset of sample 1 to be approximately that of sample 3 (thereduced model) produced a classifier with similar proper-ties to that of sample 3. The difference in classificationaccuracy, therefore, seems to be the result of the differencein the variables available and not the difference in samplesize or differences between the universities.The student population of sample 3 is substantially more

diverse than that of sample 1 or 2. Model performancepredicting only the outcomes of minority demographicsubgroups was approximately that of the overall modelperformance with somewhat lower variation than sample 1.This suggests that the differences in model performance fordemographic subgroups observed in sample 1 were not aresult of the low representation of those groups in the

sample. Low SES students were also analyzed separately;the model performance for low SES students was similar tothe overall model performance.

B. Variable importance

Once constructed, classification models can providephysics instructors and departments a much more nuancedpicture of student risk and provide tools to better serve theirstudents. This section and the next section will introducesome of the additional insights which can be extracted oncea classification model is constructed.Institutional data are exceptionally complex; random

forest classification models allow the identification of theparts of the institutional data that are important for theprediction of student risk and the thresholds in that data thatgo into classifying a student as at risk.The first measure useful in further understanding which

variables are most important in the classification process is“variable importance.” The importance of a variable to oneof the model characterization metrics such as DFWaccuracy is computed by fitting the model with the variableand then without the variable to determine the meandecrease in the characterization measure when the variableis removed from the model. Figure 1 shows the meandecrease in DFW accuracy, DFW precision, and overallaccuracy as the different variables used in the full model areremoved for sample 2 using data available in the secondweek of the class. Similar plots for samples 1 and 3 arepresented in the Supplemental Material [37].The variable importance plots shown in Fig. 1 show that

homework average followed by CGPA were the mostimportant variables in accurately identifying DFW stu-dents. In addition to these variables, only CmpPct (thepercentage of credit hours completed) has an error bar thatdoes not include zero. These results are very different thanthe variable importance results of study 1 which predictedthe AB outcome and used overall accuracy to measuremodel performance. In study 1, while homework gradegrew in variable importance from week to week, it was lessimportant than CGPA until week 5 when test 1 was given.As in study 1, a very limited number of institutionalvariables were needed to predict grades in a physics class.While many instructors would select CGPA as an impor-

tant variable and would hope that homework averages wereimportant, quantitatively having a relative measure of impor-tance is valuable. The variable importance plots in Fig. 1 alsoidentify many variables that seem important such as highschool GPA (HSGPA), ACT or SAT mathematics percentilescore (ACTM), and demographic variables, which were notimportant for the prediction of the DFW outcome.

C. Applying classification models

The most basic output of a classification model is theassignment of each student in the dataset into one of twoclasses: those students likely to receive an A, B, or C and

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-8

Page 9: Using machine learning to identify the most at-risk ...

those likely to receive a D, F, or W. Classificationalgorithms, once constructed, can provide a finer-grainedpicture of student risk that may be more useful in applyingmachine learning results to manage instructional interven-tions for at-risk students. A classification model can alsoprovide the probability a student will receive each outcome.The predicted probability density distribution of receivingan A, B, or C is plotted for each actual grade outcome inFig. 2. Two plots are provided to improve readability. Thedistribution of probability estimates of students whoactually earn an A or B is very narrow, with most studentswith a predicted probability above 0.75. This suggests thatthe students who actually receive A or B in the class arepredicted to receive an A, B, or C with very highprobability. The probability curve for students earning aC is much broader but still peaked near one. Examination ofthe C distribution illustrates two key features of theprediction: (1) the vast majority of students who actuallyearn a C are predicted to do so with probability p > 0.5 and(2) some students who receive a C are predicted to do sowith very low probability. As such, an instructor should notinterpret a low probability of receiving a C as a guaranteethat a student will not succeed in the class. The probabilitydistributions of the F and W outcomes are very broad,showing these students are very difficult to predict accu-rately. Examination of these distributions can help instruc-tors understand how an individual student’s probabilityestimate translates into actual grade outcomes and informrisk decisions.Variable importance plots quantify the relative impor-

tance of the many variables used in the classification modelcorrecting for the collinearity of many of the variables.

These plots, however, do not provide information about thelevels of these variables important in making the classi-fication. A random forest grows thousands of decision treeson a subset of the variables; examining a single decisiontree using all variables can show the thresholds for theimportant variables. The decision tree for the trainingdataset of sample 2 in week 2 of the class is shown inFig. 3. Each node in the tree is labeled with the majority

PreTakenMathEntry

HrsCmpHSGPA

URMFirstGenTransCrd

ACTMGender

PreScoreAPCredit

ClickerACTV

HrsEnrollSTEMCls

CmpPctCGPA

Homework

0.00 0.05 0.10Decrease in DFW Accuracy

ClickerPreScorePreTaken

MathEntryURM

HSGPATransCrdAPCreditHrsCmpGender

FirstGenSTEMCls

ACTMCmpPct

ACTVHrsEnroll

CGPAHomework

0.00 0.05 0.10Decrease in Overall Accuracy

ClickerPreScorePreTaken

MathEntryURM

HSGPAAPCreditTransCrd

GenderHrsCmpFirstGen

STEMClsCmpPct

ACTMACTV

HrsEnrollCGPA

Homework

0.00 0.05 0.10Decrease in DFW Precision

FIG. 1. Variable importance of the optimized model predicting DFW for sample 2 using institutional data and data available in class atthe end of week 2. Error bars are one standard deviation in length.

FIG. 2. Predicted probability of earning an A, B, or C forsample 1 disaggregated by the actual grade received in the class.The figure plots the probability density of each outcome. Theorder of the peaks in the lower figure from left to right isW, F, D, C.

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-9

Page 10: Using machine learning to identify the most at-risk ...

member of the node, either ABC or DFW. The root node(top node) contains the entire training dataset, indicated bythe 100% at the bottom node. Every node indicates thefraction of the training dataset contained in the node. Thefraction of each outcome is shown in the center of the node;for example, the root node contains 10% DFW students and90% ABC students. The decision condition is printedbelow the node. If the condition is true for the student,the left branch of the tree is taken; if false, the right branchis taken. For example, the decision condition for the rootnode is whether the week 2 homework average is above orbelow 62%. For the 8% of the students below this average,the left branch is taken to node 2. Only 47% of the studentsin node 2 receive an A, B, or C. For the 3% of thesestudents with CGPA less than 2.5, only 17% receive an A,B, or C (node 10). The decision tree gives a very clearpicture of the relative variable importance (higher variablesin the tree are more important) and the threshold of risk ofreceiving a D, F, or W at each level of the tree.

IV. DISCUSSION

This study sought to answer three research questions;they will be addressed in the order proposed.RQ1: How can machine learning algorithms be applied

to predict unbalanced physics class outcomes?Study 1 usedrandom forests and logistic regression to predict whichstudents would receive an A or B in introductory physics.The default random forest parameters were used to build themodels and the models were characterized by their overall

accuracy, κ, and AUC. Because the outcome variable wasfairly balanced, with 63% of the students receiving an A orB, overall accuracy provided an acceptable measure ofmodel performance. The pure guessing accuracy was63%, and therefore, this statistic could vary over the range63%–100% as variables were added to the model.In the current work, the methods introduced in study 1

were unproductive because the outcome variable, predict-ing the DFW outcome, was substantially unbalanced withonly 10% (sample 2) to 20% (sample 3) of the studentsreceiving this outcome. For this outcome, the pure guessingoverall accuracy (simply predicting everyone receives an A,B, or C) is from 80% to 90%, making it an inappropriatestatistic to judge model quality. This work introduced theDFW accuracy and precision as more useful statistics toevaluate model performance. In sample 1, using the defaultrandom forest algorithm parameters (Table V, defaultmodel) produced a model with very low DFW accuracyidentifying only 16% of the students who actually receiveda D, F, or W in the test dataset; however, 57% of itspredictions were correct. This does not necessarily make ita bad model, rather a model that is tuned for a specificpurpose where it is much more important for the predictionsto be correct than it is to identify the most potentially at-riskstudents. This might be useful for an application that tries toidentify students for a high cost or non-negligible-riskintervention where only the most likely at-risk studentscould be accommodated.Multiple methods were explored to improve model

performance: oversampling, undersampling, hyperpara-meter tuning, and grid search. This exploration is describedin the Supplemental Material [37]. All methods improvedthe balance of DFW accuracy and precision. Oversamplingled to models that overfit the data and was not used. Gridsearch showed that, for this dataset, it was always possibleto use hyperparameter tuning by adjusting the decisionthreshold without having to undersample to produce amodel with a balance of DFW accuracy and precision. Thedecision threshold for models in Table V excluding thedefault model and the models applied only to under-represented groups was adjusted for each model to balanceDFW accuracy and precision. For the overall model ofsample 1, this produced a model with substantially higherDFW accuracy and κ than the default model; however, itstill only identified 43% of the students who would receivea D, F, or W, DFW accuracy of 0.43, and had κ ¼ 0.36 inthe range fair agreement by Cohen’s criteria.Sample 2 restricted the time frame in which the institu-

tional data were collected to a 3-year period in which thecourse studied had a consistent instructional environment.Even though the size of the sample was much smaller,model performance was improved, showing that it isimportant to collect the training sample for a period wherethe class was presented in the same form as the class inwhich the model will be used.

ABC0.10 0.90

100%

19

1

2 3

54

6 7

8

9

10 11 12 13 14 15 16 17 18

HSGPA < 3.5

ACTM < 67

PreScore < 23

CGPA < 2.7

CmpPct < 80

Homework < 36

HrsCmp > = 17

CGPA < 2.5

Homework < 62yes no

ABC0.06 0.95

92%

ABC0.25 0.75

14%

ABC0.43 0.57

6%

ABC0.42 0.58

3%

ABC0.38 0.62

6%

ABC0.35 0.65

2%

ABC0.00 1.00

1%

ABC0.18 0.82

1%

ABC0.07 0.93

2%

ABC0.11 0.89

8%

ABC0.02 0.98

78%

DFW0.56 0.44

5%

DFW0.53 0.47

8%

DFW0.50 0.50

4%

DFW0.83 0.17

3%

DFW0.69 0.31

2%

DFW0.80 0.20

2%

DFW0.62 0.38

2%

FIG. 3. Decision tree for predicting the DFW outcome forsample 2 using institutional data and data available in class at theend of week 2.

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-10

Page 11: Using machine learning to identify the most at-risk ...

The sample 2 model using only institutional variableswas much better than models using only in-class variablesearly in the semester. If an instructor wants to developclassification models for prediction of students at risk earlyin the semester, accessing a set of institutional data cansubstantially improve the models. The combination ofinstitutional and in-class variables gave the highest modelperformance with an improvement of 3% in week 1, 6% inweek 2, 9% in week 5 (when test 1 grades were available),and 18% in week 8 (when test 2 grades were available)compared to the model containing only institutional var-iables. As such, for identification of at-risk students early inthe semester most of the prediction accuracy can beachieved with institutional data alone.Sample 3 included a more restricted set of institutional

variables than sample 1, but included a variable indicatingsocioeconomic status and featured a more demographicallydiverse population. The overall model for this sample hadweaker performance metrics than the overall model forsample 1 or the institutional model for sample 2. When theset of variables used in sample 1 was restricted to beapproximately those used in sample 3, model performancewas commensurate. It is, therefore, important for improvingmodel performance to work with institutional research toprovide the machine learning algorithms with as rich a setof data as possible.RQ2: Does classification accuracy differ for underre-

presented groups in physics? If so, how and why does itdiffer? For samples 1 and 3, once the model was constructedfor the full training dataset, the overall model was used toclassify demographic subgroups in the test dataset sepa-rately, as shown in Table V. Thesemodels examinedwomen,URM students, first-generation college students, and lowSES students. In both samples, the model performancemetrics for some minority demographic groups were differ-ent (either better or worse) than the overall model; however,these differences were within one standard deviation of theoverallmodel. As such, the classifier built on the full trainingdataset predicted the outcomes of underrepresented physicsstudents with approximately equal accuracy. While thedifferences observed in Table V are within the error of thesample, should significant differences be detected, it ispossible to retune the models for each underrepresentedgroup separately.Figure 1 and similar figures in the Supplemental Material

[37] show the demographic variables, gender, URM,FirstGen, and SES are of low importance in the classi-fication models. This is likely because these factors alreadyhave a general effect on other variables included in themodels such as CGPA. The Supplemental Material [37]includes an analysis which undersamples the majoritydemographic class (for example, men) to produce a morebalanced dataset (for example, a dataset with the samenumber of men and women) (Supplemental Figs. 7–9 [37]).The variable importance of the demographic variables used

in this study was fairly consistent with the rate of under-sampling showing that the low importance was not simply aresult of the lower number of students from minoritydemographic groups in the sample.To further investigate the low variable importance of the

demographic variables, we examined a more diversepopulation (sample 3). Model performance metrics wereconsistent with those obtained from sample 1, suggestingthe low variable importance was not the result of therestricted number of underrepresented students in thesample.RQ3: How can the results of a machine learning analysis

be used to better understand and improve physics instruc-tion? Once a classification model is constructed, the samemodel can be used to characterize new groups of students.Sections III B and III C presented three different possibleanalyses that can be performed with classification modelsthat have classroom applications.The first analysis computed the variable importance of

each variable in the classifier, Fig. 1. This is done byfinding the mean decrease in some performance metric ifthe variable is removed from the model. This analysisallows the identification of the variables which are mostpredictive of a student receiving a D, F, or W. This canshow a working instructor where to look in complexinstitutional datasets and allow departments to shape theirdata requests.The second analysis computed a probability of receiving

an A, B, or C for each individual student. This was plottedfor each actual grade received in Fig. 2. This allows anindividual quantitative risk to be applied to each student.This risk could be updated as the semester progresses basedon in-class performance.The final analysis computed a decision tree, Fig. 3. This

tree shows the decision thresholds which indicate the levelsof the variable that are important in classifying at-riskstudents. As long as the instructional setting and assign-ment policy remains consistent, these trees can be reusedsemester to semester without having to rerun the analysis.The tree shows that homework average, CGPA, and thepercent of hours completed were important in the decisionto classify a student at risk of a DFW outcome.These analysis results represent examples of the addi-

tional tools classification algorithms can provide instruc-tors; many more examples could be given. The followingrepresent some of the applications of these results beingconsidered at Institution 1. These applications are designedaround the principle that any additional instructionalactivity must potentially benefit all students. The modelsare far from perfect and, as such, all students may actuallybe at risk, so any intervention must be available to anystudent.Informing resource allocation.—Students in physics

classes at Institution 1 elect laboratory sections where asubstantial part of the interactive instruction in the course is

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-11

Page 12: Using machine learning to identify the most at-risk ...

presented. Because a success probability can be generatedfor each student, an average probability of success could becalculated for each laboratory section. The physics depart-ment has a learning assistant [29] program. If sufficientLAs were available, one could provide additional LAs to at-risk sections.Planning revised assignment policy.—The decision tree

in Fig. 3 and variable importance measures in Fig. 1 showthat homework grades in the second week of the class arethe most important variable for predicting success and givea homework score threshold of 62% as the highest leveldecision for predicting success or failure. To develop thehabit of completing homework and investing sufficienteffort to do well on homework, a policy allowing thereworking of homework assignments which received agrade of less than 60% for additional (or initial) credit couldbe implemented early in the class.Planning student communication.—Instructors can use

the variable importance results to provide general advice tostudents with low homework grades and encourage them toseek additional help by attending office hours or to changehabits so homework assignments are started earlier andsufficient time is allowed for completion. In general, aninstructor of a large service course does not have time topersonally communicate with each student; however, thecombination of the individual success probability, variableimportance, and variable decision threshold would allow aninstructor to monitor and communicate directly with asmall subset of students particularly at risk in the class.These communications could let the students know that theinstructor noticed that early homework assignments neededadditional work and suggest strategies to the students forimprovement opening channels of personal communicationwith at-risk students.Many other potential instructional uses of this type of

analysis are possible. Naturally, if the intervention issuccessful, it will modify student outcomes changingstudents’ risk profiles. The classifier will need to be rebuiltusing student outcomes after the implementation of theintervention to reflect this modified risk.While using the random forest algorithm to make

predictions is technically fairly straightforward for instruc-tors trained in physics (the base code is presented in theSupplemental Material [37]), obtaining the institutionaldataset may present a substantial barrier for overworkedinstructors of large service introductory classes. As such,we present some recommendations for managing theprocess of obtaining institutional data.Gathering additional data for use by instructors should

probably be the responsibility of a departmental committeeor staff. The data required for different classes are quitesimilar. A departmental data committee would also be ableto establish ethical standards for the use and handling of thedata. Some effort will be needed to understand the dataavailable at the institutional level and to work with

institutional research to fine-tune the data request. Forexample, if one requests a basic set of demographic anddescriptive variables about students enrolled in a courseover a number of semesters, the GPAvariable provided willprobably be the student’s current GPA where one actuallywants the student’s GPA before he or she enrolled in theclass of interest. Some interaction would also be required todevelop variables such as the student’s math readiness orthe fraction of classes completed. However, once a set ofvariables is identified, institutional records can quicklygenerate the data for the department each semester. Oncethe institutional data are acquired and understood, applyingthe machine learning code is fairly straightforward. It isalso worth pursuing the possibility that institutionalresearch could handle the entire process and provide amachine learning risk analysis to interested instructors.Student retention is of vital interest to most institutions withretention in core mathematics and science classes animportant part of the puzzle.

V. ETHICAL CONSIDERATIONS

The results of a machine learning classification representa new tool for physics instructors to shape instruction; aswith any tool, it can be correctly used or misused. If aninstructor is to use the predictions of a classificationalgorithm, it is important that these results do not biastheir treatment of individual students. Figure 2 shows that itis possible for students with very low predicted probabilityof earning an A, B, or C to get a C or higher in the class.Machine learning algorithms will never be 100% accurateand this should be taken into account in any application ofthe results of the algorithms. Further, while the classifica-tion results may be used to direct resources to the studentsmost at risk, this should be done with the goal of improvinginstruction for all students. Machine learning results shouldalso not be used to exclude students from additionaleducational activities to support at-risk students. Becausethe predictions are not 100% accurate, additional tutoringsessions or similar resources should be available to all;however, the results of classification models could be usedto deliver encouragement to the students most at risk toavail themselves of these opportunities. One should also beaware that individual features of the instructional environ-ment can affect predictive accuracy [42] and be aware of thegeneral ethical considerations of using institutionaldata [43].

VI. CONCLUSIONS

This work applied the random forest machine learningalgorithm to predict whether introductory mechanicsstudents would receive a grade of D or F or withdrawfrom a physics class. Metrics and methods applied inprevious work produced classification models with poorperformance; however, selecting metrics appropriate for

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-12

Page 13: Using machine learning to identify the most at-risk ...

unbalanced outcomes and tuning the random forest modelsgreatly improved the classification accuracy of the DFWoutcome. Classification models performed similarly forstudents from two institutions with very different demo-graphic characteristics. Models with a richer set of institu-tional variables were somewhat (7%) more accurate thanmodels with a limited set of variables. The addition of in-semester variables, particularly homework averages andtest scores, improved model performance. The institutionalmodel far outperformed a model using only in-semestervariables early in the semester; the performance of the in-semester only models exceeded that of the institutionalonly models once the first test was included as a variable.

The classifier trained on the full set of students producedsomewhat different performance for women, underrepre-sented minority students, and first-generation collegestudents with some metrics improved and some weakerfor these students. Once a classifier is constructed, multiplenew analyses are available allowing the direction of addi-tional resources to at-risk students.

ACKNOWLEDGMENTS

This work was supported in part by the National ScienceFoundation under Grants No. ECR-1561517 and No. HRD-1834569.

[1] D. E. Meltzer and R. K. Thornton, Resource letter ALIP–1:Active-learning instruction in physics, Am. J. Phys. 80,478 (2012).

[2] S. Freeman, S. L. Eddy, M. McDonough, M. K. Smith, N.Okoroafor, H. Jordt, and M. Pat. Wenderoth, Active learn-ing increases student performance in science, engineering,and mathematics, Proc. Natl. Acad. Sci. U.S.A. 111, 8410(2014).

[3] President’s Council of Advisors on Science and Technol-ogy, Report to the President. Engage to excel: Producingone million additional college graduates with degrees inscience, technology, engineering, and mathematics, Exec-utive Office of the President, Washington, DC, 2012,https://eric.ed.gov/?id=ED541511.

[4] K. Rask, Attrition in STEM fields at a liberal arts college:The importance of grades and pre-collegiate preferences,Econ. Educ. Rev. 29, 892 (2010).

[5] X. Chen, STEM attrition: College students’ paths into andout of STEM fields, Report No. NCES 2014–001, NationalCenter for Education Statistics, US Dept. of Education,Washington DC, 2013, https://eric.ed.gov/?id=ED544470.

[6] E. J. Shaw and S. Barbuti, Patterns of persistence inintended college major with a focus on STEM majors,NACADA J. 30, 19 (2010).

[7] A. V. Maltese and R. H. Tai, Pipeline persistence: Exam-ining the association of educational experiences withearned degrees in STEM among US students, Sci. Educ.95, 877 (2011).

[8] G. Zhang, T. J. Anderson, M.W. Ohland, and B. R.Thorndyke, Identifying factors influencing engineeringstudent graduation: A longitudinal and cross-institutionalstudy, J. Eng. Educ. 93, 313 (2004).

[9] B. F. French, J. C. Immekus, and W. C. Oakes, An exami-nation of indicators of engineering students’ success andpersistence, J. Eng. Educ. 94, 419 (2005).

[10] R. M. Marra, K. A. Rodgers, D. Shen, and B. Bogue,Leaving engineering: A multi-year single institution study,J. Eng. Educ. 101, 6 (2012).

[11] C. W. Hall, P. J. Kauffmann, K. L. Wuensch, W. E. Swart,K. A. DeUrquidi, O. H. Griffin, and C. S. Duncan, Aptitudeand personality traits in retention of engineering students,J. Eng. Educ. 104, 167 (2015).

[12] P. Baepler and C. J. Murdoch, Academic analytics and datamining in higher education, Int. J. Scholarship. Teach.Learn. 4, 17 (2010).

[13] R. S. J. D. Baker and K. Yacef, The state of educationaldata mining in 2009: A review and future visions, J. Educ.Data Mining 1, 3 (2009).

[14] Z. Papamitsiou and A. A. Economides, Learning analyticsand educational data mining in practice: A systematicliterature review of empirical evidence, J. Educ. Tech. Soc.17, 49 (2014), https://www.jstor.org/stable/jeductechsoci.17.4.49.

[15] A. Dutt, M. A. Ismail, and T. Herawan, A systematicreview on educational data mining, IEEE Access 5, 15991(2017).

[16] C. Romero and S. Ventura, Educational data mining: Areview of the state of the art, IEEE Trans. Syst. ManCybern. C 40, 601 (2010).

[17] C. Zabriskie, J. Yang, S. DeVore, and J. Stewart, Usingmachine learning to predict physics course outcomes,Phys. Rev. Phys. Educ. Res. 15, 020120 (2019).

[18] A. Peña-Ayala, Educational data mining: A survey and adata mining-based analysis of recent works, Expert Syst.Appl. 41, 1432 (2014).

[19] C. Romero, S. Ventura, P. G. Espejo, and C. Hervás, Datamining algorithms to classify students, inProceedings of the1st International Conference on Educational Data Mining,Montreal, 2008, edited byR. S. Joazeiro deBaker, T. Barnes,and J. E. Beck (InternationalWorkingGroup on EducationalData Mining, Montreal, Quebec, Canada, 2008).

[20] G. James, D. Witten, T. Hastie, and R. Tibshirani, AnIntroduction to Statistical Learning with Applications in R(Springer-Verlag, New York, 2017), Vol. 112.

[21] A. C. Müller and S. Guido, Introduction to MachineLearning with Python: A Guide for Data Scientists(O’Reilly Media, Boston, MA, 2016).

USING MACHINE LEARNING TO IDENTIFY THE … PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-13

Page 14: Using machine learning to identify the most at-risk ...

[22] A. M. Shahiri, W. Husain, and N. A. Rashid, A review onpredicting student’s performance using data mining tech-niques, Procedia Comput. Sci. 72, 414 (2015).

[23] S. Huang and N. Fang, Predicting student academicperformance in an engineering dynamics course: A com-parison of four types of predictive mathematical models,Comput. Educ. 61, 133 (2013).

[24] F. Marbouti, H. A. Diefes-Dux, and K. Madhavan, Modelsfor early prediction of at-risk students in a course usingstandards-based grading, Comput. Educ. 103, 1 (2016).

[25] L. P. Macfadyen and S. Dawson, Mining LMS data todevelop an early warning system for educators: A proof ofconcept, Comput. Educ. 54, 588 (2010).

[26] U. bin Mat, N. Buniyamin, P. M. Arsad, and R. Kassim, Anoverview of using academic analytics to predict andimprove students’ achievement: A proposed proactiveintelligent intervention, in Proceedings of the 2013 IEEE5th Conference on Engineering Education (ICEED)(IEEE, New York, 2013), pp. 126–130.

[27] J. M. Aiken, R. Henderson, and M. D. Caballero, Modelingstudent pathways in a physics bachelor’s degree program,Phys. Rev. Phys. Educ. Res. 15, 010128 (2019).

[28] US News & World Report: Education, https://premium.usnews.com/best-colleges, accessed Feb. 23, 2019.

[29] V. Otero, S. Pollock, and N. Finkelstein, A physicsdepartment’s role in preparing physics teachers: TheColorado Learning Assistant model, Am. J. Phys. 78,1218 (2010).

[30] L. C. McDermott and P. S. Shaffer, Tutorials in Introduc-tory Physics (Prentice-Hall, Upper Saddle River, NJ, 1998).

[31] E. Elby, R. E. Scherr, T. McCaskey, R. Hodges, T. Bing, D.Hammer, and E. F. Redish, Open source tutorials in physicssensemaking, http://umdperg.pbworks.com/w/page/10511218/Open Source Tutorials, accessed Sept. 17, 2018.

[32] A. L. Traxler, X. C. Cid, J. Blue, and R. Barthelemy,Enriching gender in physics education research: A binary

past and a complex future, Phys. Rev. Phys. Educ. Res. 12,020114 (2016).

[33] R. K. Thornton and D. R. Sokoloff, Assessing studentlearning of Newton’s laws: The Force and Motion Con-ceptual Evaluation and the evaluation of active learninglaboratory and lecture curricula, Am. J. Phys. 66, 338(1998).

[34] Pell grants, https://www.scholarships.com/financial-aid/federal-aid/federal-pell-grants, accessed July 11, 2020.

[35] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification and Regression Trees (Wadsworth andBrooks/Cole, Monterey, CA, 1984).

[36] T. Hastie, R. Tibshirani, and J. Friedman, The Elements ofStatistical Learning: Data Mining, Inference, and Predic-tion (Springer-Verlag, New York, 2009).

[37] See Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevPhysEducRes.16.020130for model tuning, investigation on underrepresentedgroups, and sample random forest code.

[38] T. Fawcett, An introduction to ROC analysis, PatternRecogn. Lett. 27, 861 (2006).

[39] J. Cohen, Statistical Power Analysis for the BehavioralSciences (Academic Press, New York, 1977).

[40] D. G. Altman, Practical Statistics for Medical Research(CRC Press, Boca Raton, FL, 1990).

[41] D. W. Hosmer, Jr., S. Lemeshow, and R. X. Sturdivant,Applied Logistic Regression (John Wiley & Sons,New York, 2013), Vol. 398.

[42] D. Gašević, S. Dawson, T. Rogers, and D. Gasevic,Learning analytics should not promote one size fits all:The effects of instructional conditions in predicting aca-demic success, Internet High. Educ. 28, 68 (2016).

[43] L. D. Roberts, V. Chang, and D. Gibson, Ethical consid-erations in adopting a university- and system-wide ap-proach to data and learning analytics, in Big Data andLearning Analytics in Higher Education (Springer,New York, 2017), pp. 89–108.

JIE YANG et al. PHYS. REV. PHYS. EDUC. RES. 16, 020130 (2020)

020130-14