Addressing data imbalance in collision risk prediction with active generative oversampling
Abstract Data imbalance is a critical factor affecting the predictive accuracy in collision risk assessment. This study proposes an advanced active generative oversampling method based on Query by Committee (QBC) and Auxiliary Classifier Generative Adversarial Network (ACGAN), integrated with the Wa...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-03-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-93851-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Data imbalance is a critical factor affecting the predictive accuracy in collision risk assessment. This study proposes an advanced active generative oversampling method based on Query by Committee (QBC) and Auxiliary Classifier Generative Adversarial Network (ACGAN), integrated with the Wasserstein Generative Adversarial Network (WGAN) framework. Our method selectively enriches minority class samples through QBC and diversity metrics to enhance the diversity of sample generation, thereby improving the performance of fault classification algorithms. By equating the labels of selected samples to those of real samples, we increase the accuracy of the discriminator, forcing the generator to produce more diverse outputs, which is expected to improve classification results. We also propose a method for dynamically adjusting the training epochs of the generator and discriminator based on loss differences to achieve balance in model training. Empirical analysis on four publicly available imbalanced datasets shows that our method outperforms existing methods in terms of precision, recall, F-measure, and G-mean. Specifically, our method’s results are above 0.92 on all evaluation indicators, with an average improvement of 23–28.3% compared to the worst-performing ENN method. This indicates that our method has a significant advantage in handling data imbalance, being able to more accurately identify collision samples and reduce the misclassification rate of non-collision samples. |
|---|---|
| ISSN: | 2045-2322 |