Abstract
This paper presents faster implementations of the lattice-based schemes Dilithium and Kyber on the Cortex-M4. Dilithium is one of three signature finalists in the NIST post-quantum project (NIST PQC), while Kyber is one of four key-encapsulation mechanism (KEM) finalists.
Our optimizations affect the core polynomial arithmetic involving number-theoretic transforms in both schemes. Our main contributions are threefold: We present a faster signed Barrett reduction for Kyber, propose to switch to a smaller prime modulus for the polynomial multiplications cs(1) and cs(2) in the signing procedure of Dilithium, and apply various known optimizations to the polynomial arithmetic in both schemes. Using a smaller prime modulus is particularly interesting as it allows using the Fermat number transform resulting in especially fast code.
We outperform the state-of-the-art for both Dilithium and Kyber. For Dilithium, our NTT and iNTT are faster by 5.2% and 5.7%. Switching to a smaller modulus results in speed-up of 33.1%-37.6% for the relevant operations (sum of the base multiplication and iNTT) in the signing procedure. For Kyber, the optimizations results in 15.9%-17.8% faster matrix-vector product which is a core arithmetic operation in Kyber.