High-Efficiency Multi-Standard Polynomial Multiplication Accelerator on RISC-V SoC for Post-Quantum Cryptography
Number Theoretic Transform (NTT) enables speeding up polynomial multiplications, thereby accelerating the implementation of lattice-based post-quantum cryptography (PQC) algorithms. Currently, the standardized PQC algorithms FIPS 203 (CRYSTALS-Kyber), FIPS 204 (CRYSTALS-Dilithium), and the one in th...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10811006/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Number Theoretic Transform (NTT) enables speeding up polynomial multiplications, thereby accelerating the implementation of lattice-based post-quantum cryptography (PQC) algorithms. Currently, the standardized PQC algorithms FIPS 203 (CRYSTALS-Kyber), FIPS 204 (CRYSTALS-Dilithium), and the one in the process of being standardized FIPS 206 (FALCON) all use the NTT to perform polynomial multiplication. This paper proposes a high-speed, low-complexity, and run-time configurable accelerator that supports all three standards. Firstly, we propose a unified design using four parallel radix-2 butterflies targeting a high-speed polynomial multiplier. With a unified design, the accelerator performs NTT, inverse NTT (INTT), point-wise multiplication (PWM), and matrix-vector polynomial multiplication. Secondly, we propose a compact, configurable reordering unit for effective coefficient processing in high-parallelism. As a bonus, the required memory size is minimal, and the memory access pattern is straightforward. Finally, we present a RISC-V SoC architecture with a loosely coupled accelerator through register-map communication and the data flow to accelerate NTT-based operation in software. The FPGA implementation results show that the achieved speed for NTT/INTT/PWM executions is 224/224/64 clock cycles (CCs) for Kyber, 512/512/128 CCs for Dilithium, 576/576/128 CCs for FALCON-512, and 1280/1280/256 CCs for FALCON-1024, respectively. The Area<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>Time Product (ATP) results also show superiority over other algorithm-specific and configurable designs, achieving improvement up to 82%, 63%, 79%, and 50% for Kyber, Dilithium, FALCON-512, and FALCON-1024, respectively. The SoC implementation results show that the NTT-based operations have improved by up to <inline-formula> <tex-math notation="LaTeX">$5.29\times , 27.49\times , 56.79\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$58.91\times $ </tex-math></inline-formula> in software; and speed-up up to <inline-formula> <tex-math notation="LaTeX">$10.53\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$9.81\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$9.57\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$9.99\times $ </tex-math></inline-formula> for the considered algorithms compared to previous SW/HW works on RISC-V platforms. |
|---|---|
| ISSN: | 2169-3536 |