Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations

Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioacc...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Toxics
Subjects:
Online Access:https://www.mdpi.com/2305-6304/13/7/579
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.
ISSN:2305-6304