Fast, Robust, and Learned Distribution-Based Sorting

Despite a long history of research and achievements in designing sorting routines, new hardware features and application requirements pose advanced challenges that computer scientists are recently tackling through novel data-aware (learned) algorithms. This paper aims to better understand the streng...

Full description

Saved in:
Bibliographic Details
Main Authors: Paolo Ferragina, Mattia Odorisio
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10918649/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850023106680193024
author Paolo Ferragina
Mattia Odorisio
author_facet Paolo Ferragina
Mattia Odorisio
author_sort Paolo Ferragina
collection DOAJ
description Despite a long history of research and achievements in designing sorting routines, new hardware features and application requirements pose advanced challenges that computer scientists are recently tackling through novel data-aware (learned) algorithms. This paper aims to better understand the strengths and limitations of learned sorters for numerical data, by addressing two main research and engineering questions: (Q1) Which is the best trade-off, in sorting items, between the efficacy of a “learned” model in approximating the distribution of the input data and its time-space complexity? and (Q2) Which algorithmic structure for distribution-based sorting is suited to leverage a learned model in order to achieve state-of-the-art performance? To answer Q1, we implement a variant of the best-known learned model (i.e., the two-layers RMI) and, also, design a fully new learned model that turns out to be space-time efficient and efficacious in approximating adversarial input distributions. We experimentally test them over 11 datasets of 200M and 800M 64-bit floating-point items, thus offering a comprehensive answer to Q1. We then address Q2 by plugging the best-learned models from above into a distribution-based sorting scheme that leads to three new sorters whose performance is tested over 33 datasets (real and synthetic) of sizes up to 800M items. Our experimental results will show that our sorters perform better than 6 (classic and learned) best-known sorters on 29 out of those 33 datasets, thus achieving new state-of-the-art tradeoffs.
format Article
id doaj-art-1f30e175a781437a928fa88685bf0bb4
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-1f30e175a781437a928fa88685bf0bb42025-08-20T03:01:28ZengIEEEIEEE Access2169-35362025-01-0113451984521410.1109/ACCESS.2025.354962610918649Fast, Robust, and Learned Distribution-Based SortingPaolo Ferragina0https://orcid.org/0000-0003-1353-360XMattia Odorisio1https://orcid.org/0009-0001-5320-8513Department of L’EMbeDS, Sant’Anna School of Advanced Studies, Pisa, ItalyDepartment of Computer Science, University of Pisa, Pisa, ItalyDespite a long history of research and achievements in designing sorting routines, new hardware features and application requirements pose advanced challenges that computer scientists are recently tackling through novel data-aware (learned) algorithms. This paper aims to better understand the strengths and limitations of learned sorters for numerical data, by addressing two main research and engineering questions: (Q1) Which is the best trade-off, in sorting items, between the efficacy of a “learned” model in approximating the distribution of the input data and its time-space complexity? and (Q2) Which algorithmic structure for distribution-based sorting is suited to leverage a learned model in order to achieve state-of-the-art performance? To answer Q1, we implement a variant of the best-known learned model (i.e., the two-layers RMI) and, also, design a fully new learned model that turns out to be space-time efficient and efficacious in approximating adversarial input distributions. We experimentally test them over 11 datasets of 200M and 800M 64-bit floating-point items, thus offering a comprehensive answer to Q1. We then address Q2 by plugging the best-learned models from above into a distribution-based sorting scheme that leads to three new sorters whose performance is tested over 33 datasets (real and synthetic) of sizes up to 800M items. Our experimental results will show that our sorters perform better than 6 (classic and learned) best-known sorters on 29 out of those 33 datasets, thus achieving new state-of-the-art tradeoffs.https://ieeexplore.ieee.org/document/10918649/Sortinglearned sortlearned indexeslearned-based data structures
spellingShingle Paolo Ferragina
Mattia Odorisio
Fast, Robust, and Learned Distribution-Based Sorting
IEEE Access
Sorting
learned sort
learned indexes
learned-based data structures
title Fast, Robust, and Learned Distribution-Based Sorting
title_full Fast, Robust, and Learned Distribution-Based Sorting
title_fullStr Fast, Robust, and Learned Distribution-Based Sorting
title_full_unstemmed Fast, Robust, and Learned Distribution-Based Sorting
title_short Fast, Robust, and Learned Distribution-Based Sorting
title_sort fast robust and learned distribution based sorting
topic Sorting
learned sort
learned indexes
learned-based data structures
url https://ieeexplore.ieee.org/document/10918649/
work_keys_str_mv AT paoloferragina fastrobustandlearneddistributionbasedsorting
AT mattiaodorisio fastrobustandlearneddistributionbasedsorting