Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and s...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Open Journal of the Computer Society |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10694791/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846163744193773568 |
|---|---|
| author | Marcello Barbirotta Francesco Minervini Carlos Rojas Morales Adrian Cristal Osman Unsal Mauro Olivieri |
| author_facet | Marcello Barbirotta Francesco Minervini Carlos Rojas Morales Adrian Cristal Osman Unsal Mauro Olivieri |
| author_sort | Marcello Barbirotta |
| collection | DOAJ |
| description | High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency. |
| format | Article |
| id | doaj-art-26116412d4934c9595d32832eb3f552f |
| institution | Kabale University |
| issn | 2644-1268 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Open Journal of the Computer Society |
| spelling | doaj-art-26116412d4934c9595d32832eb3f552f2024-11-19T00:03:45ZengIEEEIEEE Open Journal of the Computer Society2644-12682024-01-01555356510.1109/OJCS.2024.346889510694791Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing UnitMarcello Barbirotta0https://orcid.org/0000-0002-1902-7188Francesco Minervini1https://orcid.org/0000-0001-8558-5690Carlos Rojas Morales2https://orcid.org/0000-0002-7714-0277Adrian Cristal3https://orcid.org/0000-0003-1277-9296Osman Unsal4https://orcid.org/0000-0002-0544-9697Mauro Olivieri5https://orcid.org/0000-0002-0214-9904Department of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana, ItalyBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainDepartment of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana, ItalyHigh-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.https://ieeexplore.ieee.org/document/10694791/Fault injectionfault tolerancehigh-performance computingRISC-Vvector processing unit |
| spellingShingle | Marcello Barbirotta Francesco Minervini Carlos Rojas Morales Adrian Cristal Osman Unsal Mauro Olivieri Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit IEEE Open Journal of the Computer Society Fault injection fault tolerance high-performance computing RISC-V vector processing unit |
| title | Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit |
| title_full | Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit |
| title_fullStr | Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit |
| title_full_unstemmed | Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit |
| title_short | Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit |
| title_sort | enhancing fault tolerance in high performance computing a real hardware case study on a risc v vector processing unit |
| topic | Fault injection fault tolerance high-performance computing RISC-V vector processing unit |
| url | https://ieeexplore.ieee.org/document/10694791/ |
| work_keys_str_mv | AT marcellobarbirotta enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit AT francescominervini enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit AT carlosrojasmorales enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit AT adriancristal enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit AT osmanunsal enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit AT mauroolivieri enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit |