Aspen Open Jets: unlocking LHC data for foundation models in particle physics

Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models...

Full description

Saved in:
Bibliographic Details
Main Authors: Oz Amram, Luca Anzalone, Joschka Birk, Darius A Faroughy, Anna Hallin, Gregor Kasieczka, Michael Krämer, Ian Pang, Humberto Reyes-Gonzalez, David Shih
Format: Article
Language:English
Published: IOP Publishing 2025-01-01
Series:Machine Learning: Science and Technology
Subjects:
Online Access:https://doi.org/10.1088/2632-2153/ade58f
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850083508318371840
author Oz Amram
Luca Anzalone
Joschka Birk
Darius A Faroughy
Anna Hallin
Gregor Kasieczka
Michael Krämer
Ian Pang
Humberto Reyes-Gonzalez
David Shih
author_facet Oz Amram
Luca Anzalone
Joschka Birk
Darius A Faroughy
Anna Hallin
Gregor Kasieczka
Michael Krämer
Ian Pang
Humberto Reyes-Gonzalez
David Shih
author_sort Oz Amram
collection DOAJ
description Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high $p_\mathrm{T}$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet - α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use.
format Article
id doaj-art-ec5cd64776b7481f8aade6b7a63e0901
institution DOAJ
issn 2632-2153
language English
publishDate 2025-01-01
publisher IOP Publishing
record_format Article
series Machine Learning: Science and Technology
spelling doaj-art-ec5cd64776b7481f8aade6b7a63e09012025-08-20T02:44:16ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016303060110.1088/2632-2153/ade58fAspen Open Jets: unlocking LHC data for foundation models in particle physicsOz Amram0https://orcid.org/0000-0002-3765-3123Luca Anzalone1https://orcid.org/0000-0002-0399-8836Joschka Birk2https://orcid.org/0000-0002-1931-0127Darius A Faroughy3https://orcid.org/0000-0002-4027-5477Anna Hallin4https://orcid.org/0000-0002-1551-814XGregor Kasieczka5https://orcid.org/0000-0003-3457-2755Michael Krämer6https://orcid.org/0000-0002-3089-6827Ian Pang7https://orcid.org/0000-0002-8225-7269Humberto Reyes-Gonzalez8https://orcid.org/0000-0003-3283-5208David Shih9https://orcid.org/0000-0003-3408-3871Fermi National Accelerator Laboratory , Batavia, IL 60510, United States of AmericaDepartment of Physics and Astronomy (DIFA), University of Bologna , 40127 Bologna, Italy; Istituto Nazionale di Fisica Nucleare (INFN) , 40127 Bologna, ItalyInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, GermanyInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, Germany; Center for Data and Computing in Natural Sciences (CDCS) , 22607 Hamburg, GermanyInstitut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University , 52074 Aachen, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaInstitut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University , 52074 Aachen, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaFoundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high $p_\mathrm{T}$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet - α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use.https://doi.org/10.1088/2632-2153/ade58ffoundation modelsparticle physicstransfer learninggenerationjet physicstransformer
spellingShingle Oz Amram
Luca Anzalone
Joschka Birk
Darius A Faroughy
Anna Hallin
Gregor Kasieczka
Michael Krämer
Ian Pang
Humberto Reyes-Gonzalez
David Shih
Aspen Open Jets: unlocking LHC data for foundation models in particle physics
Machine Learning: Science and Technology
foundation models
particle physics
transfer learning
generation
jet physics
transformer
title Aspen Open Jets: unlocking LHC data for foundation models in particle physics
title_full Aspen Open Jets: unlocking LHC data for foundation models in particle physics
title_fullStr Aspen Open Jets: unlocking LHC data for foundation models in particle physics
title_full_unstemmed Aspen Open Jets: unlocking LHC data for foundation models in particle physics
title_short Aspen Open Jets: unlocking LHC data for foundation models in particle physics
title_sort aspen open jets unlocking lhc data for foundation models in particle physics
topic foundation models
particle physics
transfer learning
generation
jet physics
transformer
url https://doi.org/10.1088/2632-2153/ade58f
work_keys_str_mv AT ozamram aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT lucaanzalone aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT joschkabirk aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT dariusafaroughy aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT annahallin aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT gregorkasieczka aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT michaelkramer aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT ianpang aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT humbertoreyesgonzalez aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics
AT davidshih aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics