Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioacc...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Toxics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2305-6304/13/7/579 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849733856952844288 |
|---|---|
| author | Yuan Zhang Yanting Li Yang Li Lin Zhao Yongkui Yang |
| author_facet | Yuan Zhang Yanting Li Yang Li Lin Zhao Yongkui Yang |
| author_sort | Yuan Zhang |
| collection | DOAJ |
| description | Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots. |
| format | Article |
| id | doaj-art-a007b066afb445e5b4c05415bbf4be50 |
| institution | DOAJ |
| issn | 2305-6304 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Toxics |
| spelling | doaj-art-a007b066afb445e5b4c05415bbf4be502025-08-20T03:07:57ZengMDPI AGToxics2305-63042025-07-0113757910.3390/toxics13070579Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive EquationsYuan Zhang0Yanting Li1Yang Li2Lin Zhao3Yongkui Yang4School of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaMachine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.https://www.mdpi.com/2305-6304/13/7/579machine learningdata augmentationsymbolic regressionPFAS bioaccumulationquantitative prediction |
| spellingShingle | Yuan Zhang Yanting Li Yang Li Lin Zhao Yongkui Yang Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations Toxics machine learning data augmentation symbolic regression PFAS bioaccumulation quantitative prediction |
| title | Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations |
| title_full | Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations |
| title_fullStr | Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations |
| title_full_unstemmed | Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations |
| title_short | Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations |
| title_sort | interpretable machine learning models and symbolic regressions reveal transfer of per and polyfluoroalkyl substances pfass in plants a new small data machine learning method to augment data and obtain predictive equations |
| topic | machine learning data augmentation symbolic regression PFAS bioaccumulation quantitative prediction |
| url | https://www.mdpi.com/2305-6304/13/7/579 |
| work_keys_str_mv | AT yuanzhang interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations AT yantingli interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations AT yangli interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations AT linzhao interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations AT yongkuiyang interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations |