Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project

Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data sou...

Full description

Saved in:
Bibliographic Details
Main Authors: Gustavo Fonseca, Danilo Candido Vieira
Format: Article
Language:English
Published: Instituto Oceanográfico da Universidade de São Paulo 2024-04-01
Series:Ocean and Coastal Research
Subjects:
Online Access:https://www.journals.usp.br/ocr/article/view/222935
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849316005450350592
author Gustavo Fonseca
Danilo Candido Vieira
author_facet Gustavo Fonseca
Danilo Candido Vieira
author_sort Gustavo Fonseca
collection DOAJ
description Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content, grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and monitoring programs, the application of machine learning's random forest (RF) technique for modeling univariate data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM) and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation technique demonstrated superior performance compared to other methods. Both machine learning techniques discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between the RF model's overall statistics and those of individual classes. The joint interpretation of these statistics aids in comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the  data. Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges, can be seamlessly implemented within iMESc, an open-source application.
format Article
id doaj-art-93c969301d6d4073a1712c4f8a55d2f8
institution Kabale University
issn 2675-2824
language English
publishDate 2024-04-01
publisher Instituto Oceanográfico da Universidade de São Paulo
record_format Article
series Ocean and Coastal Research
spelling doaj-art-93c969301d6d4073a1712c4f8a55d2f82025-08-20T03:51:59ZengInstituto Oceanográfico da Universidade de São PauloOcean and Coastal Research2675-28242024-04-0171Suppl. 3Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos projectGustavo FonsecaDanilo Candido Vieira Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content, grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and monitoring programs, the application of machine learning's random forest (RF) technique for modeling univariate data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM) and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation technique demonstrated superior performance compared to other methods. Both machine learning techniques discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between the RF model's overall statistics and those of individual classes. The joint interpretation of these statistics aids in comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the  data. Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges, can be seamlessly implemented within iMESc, an open-source application. https://www.journals.usp.br/ocr/article/view/222935Self-organizing mapRandom forestOceanographyModelingSantos basin
spellingShingle Gustavo Fonseca
Danilo Candido Vieira
Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
Ocean and Coastal Research
Self-organizing map
Random forest
Oceanography
Modeling
Santos basin
title Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
title_full Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
title_fullStr Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
title_full_unstemmed Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
title_short Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
title_sort overcoming the challenges of data integration in ecosystem studies with machine learning workflows an example from the santos project
topic Self-organizing map
Random forest
Oceanography
Modeling
Santos basin
url https://www.journals.usp.br/ocr/article/view/222935
work_keys_str_mv AT gustavofonseca overcomingthechallengesofdataintegrationinecosystemstudieswithmachinelearningworkflowsanexamplefromthesantosproject
AT danilocandidovieira overcomingthechallengesofdataintegrationinecosystemstudieswithmachinelearningworkflowsanexamplefromthesantosproject