High-Efficiency Multi-Standard Polynomial Multiplication Accelerator on RISC-V SoC for Post-Quantum Cryptography

Number Theoretic Transform (NTT) enables speeding up polynomial multiplications, thereby accelerating the implementation of lattice-based post-quantum cryptography (PQC) algorithms. Currently, the standardized PQC algorithms FIPS 203 (CRYSTALS-Kyber), FIPS 204 (CRYSTALS-Dilithium), and the one in th...

Full description

Saved in:
Bibliographic Details
Main Authors: Duc-Thuan Dam, Trong-Hung Nguyen, Thai-Ha Tran, Duc-Hung Le, Trong-Thuc Hoang, Cong-Kha Pham
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10811006/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Number Theoretic Transform (NTT) enables speeding up polynomial multiplications, thereby accelerating the implementation of lattice-based post-quantum cryptography (PQC) algorithms. Currently, the standardized PQC algorithms FIPS 203 (CRYSTALS-Kyber), FIPS 204 (CRYSTALS-Dilithium), and the one in the process of being standardized FIPS 206 (FALCON) all use the NTT to perform polynomial multiplication. This paper proposes a high-speed, low-complexity, and run-time configurable accelerator that supports all three standards. Firstly, we propose a unified design using four parallel radix-2 butterflies targeting a high-speed polynomial multiplier. With a unified design, the accelerator performs NTT, inverse NTT (INTT), point-wise multiplication (PWM), and matrix-vector polynomial multiplication. Secondly, we propose a compact, configurable reordering unit for effective coefficient processing in high-parallelism. As a bonus, the required memory size is minimal, and the memory access pattern is straightforward. Finally, we present a RISC-V SoC architecture with a loosely coupled accelerator through register-map communication and the data flow to accelerate NTT-based operation in software. The FPGA implementation results show that the achieved speed for NTT/INTT/PWM executions is 224/224/64 clock cycles (CCs) for Kyber, 512/512/128 CCs for Dilithium, 576/576/128 CCs for FALCON-512, and 1280/1280/256 CCs for FALCON-1024, respectively. The Area<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>Time Product (ATP) results also show superiority over other algorithm-specific and configurable designs, achieving improvement up to 82%, 63%, 79%, and 50% for Kyber, Dilithium, FALCON-512, and FALCON-1024, respectively. The SoC implementation results show that the NTT-based operations have improved by up to <inline-formula> <tex-math notation="LaTeX">$5.29\times , 27.49\times , 56.79\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$58.91\times $ </tex-math></inline-formula> in software; and speed-up up to <inline-formula> <tex-math notation="LaTeX">$10.53\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$9.81\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$9.57\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$9.99\times $ </tex-math></inline-formula> for the considered algorithms compared to previous SW/HW works on RISC-V platforms.
ISSN:2169-3536