Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit

High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and s...

Full description

Saved in:
Bibliographic Details
Main Authors: Marcello Barbirotta, Francesco Minervini, Carlos Rojas Morales, Adrian Cristal, Osman Unsal, Mauro Olivieri
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Open Journal of the Computer Society
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10694791/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846163744193773568
author Marcello Barbirotta
Francesco Minervini
Carlos Rojas Morales
Adrian Cristal
Osman Unsal
Mauro Olivieri
author_facet Marcello Barbirotta
Francesco Minervini
Carlos Rojas Morales
Adrian Cristal
Osman Unsal
Mauro Olivieri
author_sort Marcello Barbirotta
collection DOAJ
description High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.
format Article
id doaj-art-26116412d4934c9595d32832eb3f552f
institution Kabale University
issn 2644-1268
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of the Computer Society
spelling doaj-art-26116412d4934c9595d32832eb3f552f2024-11-19T00:03:45ZengIEEEIEEE Open Journal of the Computer Society2644-12682024-01-01555356510.1109/OJCS.2024.346889510694791Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing UnitMarcello Barbirotta0https://orcid.org/0000-0002-1902-7188Francesco Minervini1https://orcid.org/0000-0001-8558-5690Carlos Rojas Morales2https://orcid.org/0000-0002-7714-0277Adrian Cristal3https://orcid.org/0000-0003-1277-9296Osman Unsal4https://orcid.org/0000-0002-0544-9697Mauro Olivieri5https://orcid.org/0000-0002-0214-9904Department of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana, ItalyBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainBarcelona Supercomputing Center, Barcelona, SpainDepartment of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana, ItalyHigh-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.https://ieeexplore.ieee.org/document/10694791/Fault injectionfault tolerancehigh-performance computingRISC-Vvector processing unit
spellingShingle Marcello Barbirotta
Francesco Minervini
Carlos Rojas Morales
Adrian Cristal
Osman Unsal
Mauro Olivieri
Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
IEEE Open Journal of the Computer Society
Fault injection
fault tolerance
high-performance computing
RISC-V
vector processing unit
title Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
title_full Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
title_fullStr Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
title_full_unstemmed Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
title_short Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit
title_sort enhancing fault tolerance in high performance computing a real hardware case study on a risc v vector processing unit
topic Fault injection
fault tolerance
high-performance computing
RISC-V
vector processing unit
url https://ieeexplore.ieee.org/document/10694791/
work_keys_str_mv AT marcellobarbirotta enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit
AT francescominervini enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit
AT carlosrojasmorales enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit
AT adriancristal enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit
AT osmanunsal enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit
AT mauroolivieri enhancingfaulttoleranceinhighperformancecomputingarealhardwarecasestudyonariscvvectorprocessingunit