Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations

Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioacc...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Toxics
Subjects:
Online Access:https://www.mdpi.com/2305-6304/13/7/579
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849733856952844288
author Yuan Zhang
Yanting Li
Yang Li
Lin Zhao
Yongkui Yang
author_facet Yuan Zhang
Yanting Li
Yang Li
Lin Zhao
Yongkui Yang
author_sort Yuan Zhang
collection DOAJ
description Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.
format Article
id doaj-art-a007b066afb445e5b4c05415bbf4be50
institution DOAJ
issn 2305-6304
language English
publishDate 2025-07-01
publisher MDPI AG
record_format Article
series Toxics
spelling doaj-art-a007b066afb445e5b4c05415bbf4be502025-08-20T03:07:57ZengMDPI AGToxics2305-63042025-07-0113757910.3390/toxics13070579Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive EquationsYuan Zhang0Yanting Li1Yang Li2Lin Zhao3Yongkui Yang4School of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaSchool of Environmental Science and Engineering, Tianjin University, Tianjin 300350, ChinaMachine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.https://www.mdpi.com/2305-6304/13/7/579machine learningdata augmentationsymbolic regressionPFAS bioaccumulationquantitative prediction
spellingShingle Yuan Zhang
Yanting Li
Yang Li
Lin Zhao
Yongkui Yang
Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
Toxics
machine learning
data augmentation
symbolic regression
PFAS bioaccumulation
quantitative prediction
title Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
title_full Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
title_fullStr Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
title_full_unstemmed Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
title_short Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations
title_sort interpretable machine learning models and symbolic regressions reveal transfer of per and polyfluoroalkyl substances pfass in plants a new small data machine learning method to augment data and obtain predictive equations
topic machine learning
data augmentation
symbolic regression
PFAS bioaccumulation
quantitative prediction
url https://www.mdpi.com/2305-6304/13/7/579
work_keys_str_mv AT yuanzhang interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations
AT yantingli interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations
AT yangli interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations
AT linzhao interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations
AT yongkuiyang interpretablemachinelearningmodelsandsymbolicregressionsrevealtransferofperandpolyfluoroalkylsubstancespfassinplantsanewsmalldatamachinelearningmethodtoaugmentdataandobtainpredictiveequations