Researchers propose using AI to predict which college students might fail physics classes

In a paper published on the preprint server Arxiv.org, researchers affiliated with West Virginia University and California State Polytechnic University investigate the use of machine learning algorithms to identify at-risk students in introductory physics classes. They claim it could be a powerful tool for educators and struggling college students alike, but critics argue technologies like it could harm those students with biased or misleading predictions.

Physics and other core science courses form hurdles for science, technology, engineering, and mathematics (STEM) majors early in their college careers. (Studies show roughly 40% of students planning engineering and science majors end up switching to other subjects or failing to get a degree.) While physics pedagogies have developed a range of research-based practices to help students overcome challenges, some strategies have substantial per-class implementation costs. Moreover, not all are appropriate for every student.

It's the researchers' assertion that this calls for an algorithmic method of identifying at-risk students, particularly in physics. To this end, they build on previous work that used ACT scores, college GPA, and data collected within a physics class (such as homework grades and test scores) to predict whether a student would receive an A or B in the first and second semester.

But studies show AI is relatively poor at predicting complex outcomes even when trained on large corpora -- and that it has a bias problem. For instance, word embedding, a common algorithmic training technique that involves linking words to vectors, unavoidably picks up (and at worst amplifies) prejudices implicit in source text and dialogue. And Amazon's internal recruitment tool -- which was trained on resumes submitted over a 10-year period -- was reportedly scrapped because it showed bias against women.

Nevertheless, the researchers drew samples from introductory calculus-based physics classes at two large Eastern academic institutions to train a student performance-predicting AI algorithm. The first and second corpora included physical science and engineering students at a college serving roughly 21,000 undergraduate students, with a sample size of 7,184 and 1,683 students, respectively. A third came from a primarily undergraduate and Hispanic-serving university with approximately 26,000 students in the Western U.S.

The samples were quite diverse in terms of makeup and demographics. The first and second were collected during different time frames (2000-2018 and 2016-2019) and included mostly Caucasian students (80%), with the second reflecting curricular changes during the 2011 and 2015 academic years. By contrast, the third covered a single year (2017) and was largely Hispanic (46%) and Asian (21%), with students who took a mix of lectures and active-learning-style classes.

The researchers trained what's called a random forest on the samples to predict students' final physics grades. In machine learning, random forests are an ensemble method that constructs a multitude of decision trees and outputs the mean prediction of the individual trees -- in this case, students likely to receive an A, B, or C (ABC students) or a D, F, or withdraw (W) (DFW students).

According to the researchers, an algorithm trained on the first sample predicted "DFW students" with only 16% accuracy, likely because of the low proportion of DFW students (12%) in the training set. They note that when trained on the entire sample, DFW accuracy was lower for women and higher for underrepresented minority students, which they problematically say points to a need to demographically tune models.

Demographically sensitive at-risk student prediction models are fraught, needless to say. An estimated 1,400 U.S. colleges including Georgia State are using algorithmic techniques to identify students who might be struggling so they can provide support, even encouraging those students to change their majors. But while national graduation rates started ticking back up again in 2016 after years of steep decline, there's a fear the algorithms might be reinforcing historical inequities, funneling low-income students or students of color into "easier" and lower-paying majors.

"There is historic bias in higher education, in all of our society," Iris Palmer, a senior advisor for higher education at think tank New America, told AMP Reports. "If we use that past data to predict how students are going to perform in the future, could we be baking some of that bias in? What will happen is they'll get discouraged ... and it'll end up being a self-fulfilling prophecy for those particular students."

In this latest study, when applied to the second sample, the researchers found the random forest performed marginally better (which they attribute to limiting the scope to three years and one institution as opposed to a decade and several institutions). They also found that institutional variables like gender, standardized test scores, Pell grant eligibility, and credit hours received from AP courses were less consequential than in-class data such as weekly homework and quiz grades. Random forests trained on the in-class data became better than institutional data-based models after week five of the physics classes and "substantially" better after around the eighth week. That being the case, the institutional variables and in-class data had more predictive power when combined: Compared with an institutional variable-only model, a model trained on both showed a 3% performance improvement in week one, 6% in week two, 9% in week five, and 18% in week eight.

With respect to the third sample, the researchers say models trained on it had lower DFW accuracy and precision (i.e., a measure of how close two or more measurements are) than models for the first and second corpora. The performance of models predicting only the outcomes of minority demographic subgroups in the third sample was approximately that of the overall model performance, according to the researchers, suggesting differences in performance for subgroups in the first sample weren't a result of those groups' low representation.

The researchers caution that no model will ever be 100% accurate, as evidenced by their best-performing model for the first sample (it achieved 57% accuracy overall, or only slightly better than chance). Yet they assert machine learning classification represents a tool for physics instructors to shape instruction. "If an instructor is to use the predictions of a classification algorithm, it is important that these results do not bias his or her treatment of individual students," the coauthors of the study wrote. "Machine learning results should ... not be used to exclude students from additional educational activities to support at-risk students ... However, the results of classification models could be used to deliver encouragement to the students most at risk to avail themselves of these opportunities."