Regression for Astronomical Data with Realistic Distributions, Errors, and Nonlinearity

We have developed a new regression technique, the maximum likelihood (ML)–based method and its variant, the Kolmogorov–Smirnov (KS) test–based method, designed to obtain unbiased regression results from typical astronomical data. A normalizing flow model is employed to automatically estimate the uno...

Full description

Saved in:
Bibliographic Details
Main Authors: Tao Jing, Cheng Li
Format: Article
Language:English
Published: IOP Publishing 2025-01-01
Series:The Astronomical Journal
Subjects:
Online Access:https://doi.org/10.3847/1538-3881/add891
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We have developed a new regression technique, the maximum likelihood (ML)–based method and its variant, the Kolmogorov–Smirnov (KS) test–based method, designed to obtain unbiased regression results from typical astronomical data. A normalizing flow model is employed to automatically estimate the unobservable intrinsic distribution of the independent variable and the unobservable correlation between uncertainty level and intrinsic value of both independent and dependent variables from the observed data points in a variational-inference-based empirical Bayes approach. By incorporating these estimated distributions, our method comprehensively accounts for the uncertainties associated with both independent and dependent variables. Our test on both mock data and real astronomical data from PHANGS-ALMA and PHANGS-JWST demonstrates that, given a sufficiently large sample size (>1000), both the ML-based method and the KS-test-based method significantly outperform the existing widely used methods, particularly in cases of low signal-to-noise ratios. The KS-test-based method exhibits remarkable robustness against deviations from underlying assumptions, complex intrinsic distributions, varying correlations between uncertainty levels and intrinsic values, inaccuracies in uncertainty estimations, outliers, and saturation effects. For sample sizes between 300 and 1000, the ML-based method yields the best performance. In the low-data regime (<300), the ML-based method maintains comparable performance to other state-of-the-art methods. A GPU-compatible Python implementation of our methods, nicknamed “ raddest ,” will be made publicly available upon acceptance of this paper.
ISSN:1538-3881