Aspen Open Jets: unlocking LHC data for foundation models in particle physics
Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IOP Publishing
2025-01-01
|
| Series: | Machine Learning: Science and Technology |
| Subjects: | |
| Online Access: | https://doi.org/10.1088/2632-2153/ade58f |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850083508318371840 |
|---|---|
| author | Oz Amram Luca Anzalone Joschka Birk Darius A Faroughy Anna Hallin Gregor Kasieczka Michael Krämer Ian Pang Humberto Reyes-Gonzalez David Shih |
| author_facet | Oz Amram Luca Anzalone Joschka Birk Darius A Faroughy Anna Hallin Gregor Kasieczka Michael Krämer Ian Pang Humberto Reyes-Gonzalez David Shih |
| author_sort | Oz Amram |
| collection | DOAJ |
| description | Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high $p_\mathrm{T}$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet - α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use. |
| format | Article |
| id | doaj-art-ec5cd64776b7481f8aade6b7a63e0901 |
| institution | DOAJ |
| issn | 2632-2153 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IOP Publishing |
| record_format | Article |
| series | Machine Learning: Science and Technology |
| spelling | doaj-art-ec5cd64776b7481f8aade6b7a63e09012025-08-20T02:44:16ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016303060110.1088/2632-2153/ade58fAspen Open Jets: unlocking LHC data for foundation models in particle physicsOz Amram0https://orcid.org/0000-0002-3765-3123Luca Anzalone1https://orcid.org/0000-0002-0399-8836Joschka Birk2https://orcid.org/0000-0002-1931-0127Darius A Faroughy3https://orcid.org/0000-0002-4027-5477Anna Hallin4https://orcid.org/0000-0002-1551-814XGregor Kasieczka5https://orcid.org/0000-0003-3457-2755Michael Krämer6https://orcid.org/0000-0002-3089-6827Ian Pang7https://orcid.org/0000-0002-8225-7269Humberto Reyes-Gonzalez8https://orcid.org/0000-0003-3283-5208David Shih9https://orcid.org/0000-0003-3408-3871Fermi National Accelerator Laboratory , Batavia, IL 60510, United States of AmericaDepartment of Physics and Astronomy (DIFA), University of Bologna , 40127 Bologna, Italy; Istituto Nazionale di Fisica Nucleare (INFN) , 40127 Bologna, ItalyInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, GermanyInstitut für Experimentalphysik, Universität Hamburg , 22761 Hamburg, Germany; Center for Data and Computing in Natural Sciences (CDCS) , 22607 Hamburg, GermanyInstitut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University , 52074 Aachen, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaInstitut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University , 52074 Aachen, GermanyNHETC, Dept. of Physics and Astronomy, Rutgers University , Piscataway, NJ 08854, United States of AmericaFoundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high $p_\mathrm{T}$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet - α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use.https://doi.org/10.1088/2632-2153/ade58ffoundation modelsparticle physicstransfer learninggenerationjet physicstransformer |
| spellingShingle | Oz Amram Luca Anzalone Joschka Birk Darius A Faroughy Anna Hallin Gregor Kasieczka Michael Krämer Ian Pang Humberto Reyes-Gonzalez David Shih Aspen Open Jets: unlocking LHC data for foundation models in particle physics Machine Learning: Science and Technology foundation models particle physics transfer learning generation jet physics transformer |
| title | Aspen Open Jets: unlocking LHC data for foundation models in particle physics |
| title_full | Aspen Open Jets: unlocking LHC data for foundation models in particle physics |
| title_fullStr | Aspen Open Jets: unlocking LHC data for foundation models in particle physics |
| title_full_unstemmed | Aspen Open Jets: unlocking LHC data for foundation models in particle physics |
| title_short | Aspen Open Jets: unlocking LHC data for foundation models in particle physics |
| title_sort | aspen open jets unlocking lhc data for foundation models in particle physics |
| topic | foundation models particle physics transfer learning generation jet physics transformer |
| url | https://doi.org/10.1088/2632-2153/ade58f |
| work_keys_str_mv | AT ozamram aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT lucaanzalone aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT joschkabirk aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT dariusafaroughy aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT annahallin aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT gregorkasieczka aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT michaelkramer aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT ianpang aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT humbertoreyesgonzalez aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics AT davidshih aspenopenjetsunlockinglhcdataforfoundationmodelsinparticlephysics |