StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects

A precise measurement of photometric redshifts (photo-z) is crucial for the success of modern photometric galaxy surveys. Machine learning (ML) methods show great promise in this context, but suffer from covariate shift in training sets due to selection bias where interesting sources, e.g., high red...

Full description

Saved in:
Bibliographic Details
Main Authors: Chiara Moretti, Maximilian Autenrieth, Riccardo Serra, Roberto Trotta, David A. van Dyk, Andrei Mesinger
Format: Article
Language:English
Published: Maynooth Academic Publishing 2025-05-01
Series:The Open Journal of Astrophysics
Online Access:https://doi.org/10.33232/001c.137525
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849702188697255936
author Chiara Moretti
Maximilian Autenrieth
Riccardo Serra
Roberto Trotta
David A. van Dyk
Andrei Mesinger
author_facet Chiara Moretti
Maximilian Autenrieth
Riccardo Serra
Roberto Trotta
David A. van Dyk
Andrei Mesinger
author_sort Chiara Moretti
collection DOAJ
description A precise measurement of photometric redshifts (photo-z) is crucial for the success of modern photometric galaxy surveys. Machine learning (ML) methods show great promise in this context, but suffer from covariate shift in training sets due to selection bias where interesting sources, e.g., high redshift objects, are underrepresented, and the corresponding ML models exhibit poor generalisation properties. We present an application of the StratLearn method to the estimation of photo-z (StratLearn-z), validating against simulations where we enforce the presence of covariate shift to different degrees. StratLearn is a statistically principled approach which relies on splitting the combined source and target datasets into strata, based on estimated propensity scores. The latter is the probability for an object in the dataset to be in the source set, given its observed covariates. After stratification, two conditional density estimators are fit separately within each stratum, and then combined via a weighted average. We benchmark our results against the GPz algorithm, quantifying the performance of the two algorithms with a set of metrics. Our results show that the StratLearn-z metrics are only marginally affected by the presence of covariate shift, while GPz shows a significant degradation of performance, specifically concerning the photo-z prediction for fainter objects for which there is little training data. In particular, for the strongest covariate shift scenario considered, StratLearn-z yields a reduced fraction of catastrophic errors, a factor of 2 improvement for the RMSE as well as one order of magnitude improvement on the bias. We also assess the quality of the predicted conditional redshift estimates using the probability integral transform (PIT) and the continuous rank probability score (CRPS). The PIT for StratLearn-z indicates that predictions are well-centered around the true redshift value, if conservative in their variance; the CRPS shows marked improvement at high redshifts when compared with GPz. Our julia implementation of the method, StratLearn-z, is publicly available at \url{https://github.com/chiaramoretti/StratLearn-z}.
format Article
id doaj-art-2188df5ab8324211b4d150f4aba0cab8
institution DOAJ
issn 2565-6120
language English
publishDate 2025-05-01
publisher Maynooth Academic Publishing
record_format Article
series The Open Journal of Astrophysics
spelling doaj-art-2188df5ab8324211b4d150f4aba0cab82025-08-20T03:17:44ZengMaynooth Academic PublishingThe Open Journal of Astrophysics2565-61202025-05-01810.33232/001c.137525StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effectsChiara MorettiMaximilian AutenriethRiccardo SerraRoberto TrottaDavid A. van DykAndrei MesingerA precise measurement of photometric redshifts (photo-z) is crucial for the success of modern photometric galaxy surveys. Machine learning (ML) methods show great promise in this context, but suffer from covariate shift in training sets due to selection bias where interesting sources, e.g., high redshift objects, are underrepresented, and the corresponding ML models exhibit poor generalisation properties. We present an application of the StratLearn method to the estimation of photo-z (StratLearn-z), validating against simulations where we enforce the presence of covariate shift to different degrees. StratLearn is a statistically principled approach which relies on splitting the combined source and target datasets into strata, based on estimated propensity scores. The latter is the probability for an object in the dataset to be in the source set, given its observed covariates. After stratification, two conditional density estimators are fit separately within each stratum, and then combined via a weighted average. We benchmark our results against the GPz algorithm, quantifying the performance of the two algorithms with a set of metrics. Our results show that the StratLearn-z metrics are only marginally affected by the presence of covariate shift, while GPz shows a significant degradation of performance, specifically concerning the photo-z prediction for fainter objects for which there is little training data. In particular, for the strongest covariate shift scenario considered, StratLearn-z yields a reduced fraction of catastrophic errors, a factor of 2 improvement for the RMSE as well as one order of magnitude improvement on the bias. We also assess the quality of the predicted conditional redshift estimates using the probability integral transform (PIT) and the continuous rank probability score (CRPS). The PIT for StratLearn-z indicates that predictions are well-centered around the true redshift value, if conservative in their variance; the CRPS shows marked improvement at high redshifts when compared with GPz. Our julia implementation of the method, StratLearn-z, is publicly available at \url{https://github.com/chiaramoretti/StratLearn-z}.https://doi.org/10.33232/001c.137525
spellingShingle Chiara Moretti
Maximilian Autenrieth
Riccardo Serra
Roberto Trotta
David A. van Dyk
Andrei Mesinger
StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
The Open Journal of Astrophysics
title StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
title_full StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
title_fullStr StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
title_full_unstemmed StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
title_short StratLearn-z: Improved photo-$z$ estimation from spectroscopic data subject to selection effects
title_sort stratlearn z improved photo z estimation from spectroscopic data subject to selection effects
url https://doi.org/10.33232/001c.137525
work_keys_str_mv AT chiaramoretti stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects
AT maximilianautenrieth stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects
AT riccardoserra stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects
AT robertotrotta stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects
AT davidavandyk stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects
AT andreimesinger stratlearnzimprovedphotozestimationfromspectroscopicdatasubjecttoselectioneffects