VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping

Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenge...

Full description

Saved in:
Bibliographic Details
Main Authors: Shiyu Shen, Hao Yang, Zhe Liu, Ying Liu, Xianhui Lu, Wangchen Dai, Lu Zhou, Yunlei Zhao, Ray C. C. Cheung
Format: Article
Language:English
Published: Ruhr-Universität Bochum 2025-06-01
Series:Transactions on Cryptographic Hardware and Embedded Systems
Subjects:
Online Access:https://tches.iacr.org/index.php/TCHES/article/view/12211
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850233313774534656
author Shiyu Shen
Hao Yang
Zhe Liu
Ying Liu
Xianhui Lu
Wangchen Dai
Lu Zhou
Yunlei Zhao
Ray C. C. Cheung
author_facet Shiyu Shen
Hao Yang
Zhe Liu
Ying Liu
Xianhui Lu
Wangchen Dai
Lu Zhou
Yunlei Zhao
Ray C. C. Cheung
author_sort Shiyu Shen
collection DOAJ
description Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenges in practical ap- plications due to massive ciphertexts. To address these issues, we propose VeloFHE, a CUDA-accelerated design to enhance the efficiency of FHEW and TFHE schemes on GPUs. We develop a novel hybrid four-step Number Theoretic Transform (NTT) approach for fast polynomial multiplication. By decomposing large-scale NTTs into highly parallelizable submodules, incorporating cyclic and negacyclic convolutions, and introducing several memory-oriented optimizations, we significantly reduce both the computational complexity and memory requirements. For blind rotation, besides the gadget decomposition approach, we also apply a recent proposed modulus raising technique to both schemes to alleviate memory pressure. We further optimize it by refining computational flow to reduce noise from scaling and maintain accumulator compatibility. For key switching, we address input-output parallelism mismatches, and offloading suitable computations to the CPU, effectively hiding latency through asynchronous execution. Additionally, we explore batching in bootstrapping, de- veloping a general framework that accommodates both schemes with either gadget decomposition or modulus raising method. Our experimental results demonstrate significant performance improvements. The proposed NTT implementation shows over 35% improvement compared to recent GPU implementations. On an RTX 4090 GPU, we achieve speedups of 371.86x and 390.44x for FHEW and TFHE gate bootstrapping, respectively, compared to OpenFHE running on a 48-thread CPU at a 128-bit security level. The corresponding throughputs are 7,007 and 11,378 operations per second. Furthermore, relative to the state-of-the-art GPU implementation [XLK+25], our approach provides speedups of 2.56x, 2.24x, and 2.33x for TFHE gate bootstrapping, homomorphic evaluation of arbitrary functions, and homomorphic flooring operation, respectively. Our VeloFHE surpasses some current hardware designs, offering an effective solution for more practical and efficient privacy-preserving computations.
format Article
id doaj-art-d79068f1abab4e38b98704e96f510fd6
institution OA Journals
issn 2569-2925
language English
publishDate 2025-06-01
publisher Ruhr-Universität Bochum
record_format Article
series Transactions on Cryptographic Hardware and Embedded Systems
spelling doaj-art-d79068f1abab4e38b98704e96f510fd62025-08-20T02:02:57ZengRuhr-Universität BochumTransactions on Cryptographic Hardware and Embedded Systems2569-29252025-06-012025310.46586/tches.v2025.i3.81-114VeloFHE: GPU Acceleration for FHEW and TFHE BootstrappingShiyu Shen0Hao Yang1Zhe Liu2Ying Liu3Xianhui Lu4Wangchen Dai5Lu Zhou6Yunlei Zhao7Ray C. C. Cheung8City University of Hong Kong, Hong Kong, ChinaCity University of Hong Kong, Hong Kong, China,Zhejiang Lab, Hangzhou, ChinaKey Laboratory of Cyberspace Security Defense, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, ChinaKey Laboratory of Cyberspace Security Defense, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, ChinaSun Yat-sen University, Shenzhen, ChinaNanjing University of Aeronautics and Astronautics, Nanjing, ChinaFudan University, Shanghai, ChinaCity University of Hong Kong, Hong Kong, China Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenges in practical ap- plications due to massive ciphertexts. To address these issues, we propose VeloFHE, a CUDA-accelerated design to enhance the efficiency of FHEW and TFHE schemes on GPUs. We develop a novel hybrid four-step Number Theoretic Transform (NTT) approach for fast polynomial multiplication. By decomposing large-scale NTTs into highly parallelizable submodules, incorporating cyclic and negacyclic convolutions, and introducing several memory-oriented optimizations, we significantly reduce both the computational complexity and memory requirements. For blind rotation, besides the gadget decomposition approach, we also apply a recent proposed modulus raising technique to both schemes to alleviate memory pressure. We further optimize it by refining computational flow to reduce noise from scaling and maintain accumulator compatibility. For key switching, we address input-output parallelism mismatches, and offloading suitable computations to the CPU, effectively hiding latency through asynchronous execution. Additionally, we explore batching in bootstrapping, de- veloping a general framework that accommodates both schemes with either gadget decomposition or modulus raising method. Our experimental results demonstrate significant performance improvements. The proposed NTT implementation shows over 35% improvement compared to recent GPU implementations. On an RTX 4090 GPU, we achieve speedups of 371.86x and 390.44x for FHEW and TFHE gate bootstrapping, respectively, compared to OpenFHE running on a 48-thread CPU at a 128-bit security level. The corresponding throughputs are 7,007 and 11,378 operations per second. Furthermore, relative to the state-of-the-art GPU implementation [XLK+25], our approach provides speedups of 2.56x, 2.24x, and 2.33x for TFHE gate bootstrapping, homomorphic evaluation of arbitrary functions, and homomorphic flooring operation, respectively. Our VeloFHE surpasses some current hardware designs, offering an effective solution for more practical and efficient privacy-preserving computations. https://tches.iacr.org/index.php/TCHES/article/view/12211Fully Homomorphic EncryptionBootstrappingFHEWTFHEGPU acceleration
spellingShingle Shiyu Shen
Hao Yang
Zhe Liu
Ying Liu
Xianhui Lu
Wangchen Dai
Lu Zhou
Yunlei Zhao
Ray C. C. Cheung
VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
Transactions on Cryptographic Hardware and Embedded Systems
Fully Homomorphic Encryption
Bootstrapping
FHEW
TFHE
GPU acceleration
title VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
title_full VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
title_fullStr VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
title_full_unstemmed VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
title_short VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
title_sort velofhe gpu acceleration for fhew and tfhe bootstrapping
topic Fully Homomorphic Encryption
Bootstrapping
FHEW
TFHE
GPU acceleration
url https://tches.iacr.org/index.php/TCHES/article/view/12211
work_keys_str_mv AT shiyushen velofhegpuaccelerationforfhewandtfhebootstrapping
AT haoyang velofhegpuaccelerationforfhewandtfhebootstrapping
AT zheliu velofhegpuaccelerationforfhewandtfhebootstrapping
AT yingliu velofhegpuaccelerationforfhewandtfhebootstrapping
AT xianhuilu velofhegpuaccelerationforfhewandtfhebootstrapping
AT wangchendai velofhegpuaccelerationforfhewandtfhebootstrapping
AT luzhou velofhegpuaccelerationforfhewandtfhebootstrapping
AT yunleizhao velofhegpuaccelerationforfhewandtfhebootstrapping
AT raycccheung velofhegpuaccelerationforfhewandtfhebootstrapping