Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data sou...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Instituto Oceanográfico da Universidade de São Paulo
2024-04-01
|
| Series: | Ocean and Coastal Research |
| Subjects: | |
| Online Access: | https://www.journals.usp.br/ocr/article/view/222935 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Integrating intricate environmental data within a unified analytical framework for extensive conservation and
monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model
outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content,
grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the
issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending
the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and
monitoring programs, the application of machine learning's random forest (RF) technique for modeling univariate
data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects
among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM)
and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation
technique demonstrated superior performance compared to other methods. Both machine learning techniques
discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains
feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between
the RF model's overall statistics and those of individual classes. The joint interpretation of these statistics aids in
comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed
patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model
accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the data.
Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal
increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges,
can be seamlessly implemented within iMESc, an open-source application.
|
|---|---|
| ISSN: | 2675-2824 |