A framework for scalable ambient air pollution concentration estimation

Ambient air pollution remains a global challenge, with adverse impacts on health and the environment. Addressing air pollution requires reliable data on pollutant concentrations, which form the foundation for interventions aimed at improving air quality. However, in many regions, including the Unite...

Full description

Saved in:
Bibliographic Details
Main Authors: Liam J. Berrisford, Lucy S. Neal, Helen J. Buttery, Benjamin R. Evans, Ronaldo Menezes
Format: Article
Language:English
Published: Cambridge University Press 2025-01-01
Series:Environmental Data Science
Subjects:
Online Access:https://www.cambridge.org/core/product/identifier/S2634460225000093/type/journal_article
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850023899759116288
author Liam J. Berrisford
Lucy S. Neal
Helen J. Buttery
Benjamin R. Evans
Ronaldo Menezes
author_facet Liam J. Berrisford
Lucy S. Neal
Helen J. Buttery
Benjamin R. Evans
Ronaldo Menezes
author_sort Liam J. Berrisford
collection DOAJ
description Ambient air pollution remains a global challenge, with adverse impacts on health and the environment. Addressing air pollution requires reliable data on pollutant concentrations, which form the foundation for interventions aimed at improving air quality. However, in many regions, including the United Kingdom, air pollution monitoring networks are characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements within the United Kingdom. The machine learning framework used is LightGBM, a gradient boosting algorithm based on decision trees, for efficient and scalable modeling. This approach provides a comprehensive dataset for England throughout 2018 at a 1 km2 hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area. Validation was conducted to assess the model’s performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for nitrogen dioxide (NO2), Ozone (O3), particulate matter with a diameter of 10 μm or less (PM10), particulate matter with a diameter of 2.5 μm or less PM2.5, and sulphur dioxide (SO2), at a higher resolution than was previously possible.
format Article
id doaj-art-6ed19bf4ab754b929be353beb2250f47
institution DOAJ
issn 2634-4602
language English
publishDate 2025-01-01
publisher Cambridge University Press
record_format Article
series Environmental Data Science
spelling doaj-art-6ed19bf4ab754b929be353beb2250f472025-08-20T03:01:15ZengCambridge University PressEnvironmental Data Science2634-46022025-01-01410.1017/eds.2025.9A framework for scalable ambient air pollution concentration estimationLiam J. Berrisford0https://orcid.org/0000-0001-6578-3497Lucy S. Neal1Helen J. Buttery2https://orcid.org/0009-0009-9726-5315Benjamin R. Evans3https://orcid.org/0000-0003-4696-596XRonaldo Menezes4https://orcid.org/0000-0002-6479-6429BioComplex Laboratory, Department of Computer Science, University of Exeter, Exeter, UK Department of Mathematics, University of Exeter, Exeter, UK UKRI Centre for Doctoral Training in Environmental Intelligence, University of Exeter, Exeter, UKMet Office, Exeter, UKMet Office, Exeter, UKMet Office, Exeter, UKBioComplex Laboratory, Department of Computer Science, University of Exeter, Exeter, UK Department of Computer Science, Federal University of Ceará, Fortaleza, BrazilAmbient air pollution remains a global challenge, with adverse impacts on health and the environment. Addressing air pollution requires reliable data on pollutant concentrations, which form the foundation for interventions aimed at improving air quality. However, in many regions, including the United Kingdom, air pollution monitoring networks are characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements within the United Kingdom. The machine learning framework used is LightGBM, a gradient boosting algorithm based on decision trees, for efficient and scalable modeling. This approach provides a comprehensive dataset for England throughout 2018 at a 1 km2 hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area. Validation was conducted to assess the model’s performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for nitrogen dioxide (NO2), Ozone (O3), particulate matter with a diameter of 10 μm or less (PM10), particulate matter with a diameter of 2.5 μm or less PM2.5, and sulphur dioxide (SO2), at a higher resolution than was previously possible.https://www.cambridge.org/core/product/identifier/S2634460225000093/type/journal_articleair qualitydata sciencemachine learningsustainable developmenturban resilience and justice
spellingShingle Liam J. Berrisford
Lucy S. Neal
Helen J. Buttery
Benjamin R. Evans
Ronaldo Menezes
A framework for scalable ambient air pollution concentration estimation
Environmental Data Science
air quality
data science
machine learning
sustainable development
urban resilience and justice
title A framework for scalable ambient air pollution concentration estimation
title_full A framework for scalable ambient air pollution concentration estimation
title_fullStr A framework for scalable ambient air pollution concentration estimation
title_full_unstemmed A framework for scalable ambient air pollution concentration estimation
title_short A framework for scalable ambient air pollution concentration estimation
title_sort framework for scalable ambient air pollution concentration estimation
topic air quality
data science
machine learning
sustainable development
urban resilience and justice
url https://www.cambridge.org/core/product/identifier/S2634460225000093/type/journal_article
work_keys_str_mv AT liamjberrisford aframeworkforscalableambientairpollutionconcentrationestimation
AT lucysneal aframeworkforscalableambientairpollutionconcentrationestimation
AT helenjbuttery aframeworkforscalableambientairpollutionconcentrationestimation
AT benjaminrevans aframeworkforscalableambientairpollutionconcentrationestimation
AT ronaldomenezes aframeworkforscalableambientairpollutionconcentrationestimation
AT liamjberrisford frameworkforscalableambientairpollutionconcentrationestimation
AT lucysneal frameworkforscalableambientairpollutionconcentrationestimation
AT helenjbuttery frameworkforscalableambientairpollutionconcentrationestimation
AT benjaminrevans frameworkforscalableambientairpollutionconcentrationestimation
AT ronaldomenezes frameworkforscalableambientairpollutionconcentrationestimation