High Performance of Hash-based Signature Schemes

Hash-based signature schemes, whose security is based on properties of the underlying hash functions, are promising candidates to be quantum-safe digital signatures schemes. In this work, we present a software implementation of two recent standard proposals for hash-based signature schemes, Leighton and Micali Signature (LMS) scheme and Extended Merkle Signature Scheme (XMSS), using a set of AVX2 instructions on Intel processors. The implementation uses several optimization techniques for speeding up the underlying hash functions SHA2 or SHA3, and other building block functions which lead to high performance for signature operations on both schemes. On an Intel Skylake processor, using a tree of height 60 with 12 layers, the signing operation for XMSS takes 3,841,199 cycles (1,043 signatures per second) at 128-bit security level (against quantum attacks). For an equivalent security, the LMS system computes a signature in 1,307,376 cycles (3,065 signatures per second). We also provide the first comparative performance results for signing and verification of both schemes using different parameters. The results of our implementation indicate that both schemes LMS and XMSS can achieve high performance using vector instructions on modern processors. Keywords—post-quantum cryptography; digital signature; Merkle signature; LMS; XMSS


I. INTRODUCTION
A digital signature scheme is an important cryptographic tool for public-key cryptography.Digital signature scheme are widely used for providing authenticity, integrity, and nonrepudiation of data.Nowadays, the most commonly used digital signature schemes are ECDSA [1], RSA [2] and DSA [3].These schemes have their security based on the difficulty of factoring large integers or computing discrete logarithms.In [4], Shor introduced a polynomial-time quantum algorithm for factoring and computing discrete logarithms.Thus, digital signature schemes that can resist an attack by quantum computers are an active area of research.
A One-Time Signature (OTS) scheme allows using a key pair to sign exactly one message [5].These schemes are inadequate for the most practical situations since each key pair is used only for a single signature.In [6], Merkle proposed N-Time Signature (NTS), that are built out of one-time signature schemes.The Merkle Signature Scheme (MSS) makes onetime signatures practical by combining N = 2 h of them in a single structure, which is a complete binary tree of height h.These systems have regained interest since 2006 because of their resistance against quantum-computer-aided attacks.Since the security of these schemes is based on the underlying cryptographic hash function, they are called hashbased signature schemes.Independently of the actual realization of quantum computing, governmental and standardization organizations are encouraging the transition to post-quantum cryptography, i.e. cryptographic schemes not known to be vulnerable to quantum computer attacks [7].Standardization efforts are under way, for example, the National Institute of Standards and Technology (NIST) is now accepting submissions for quantum-resistant public-key cryptographic algorithms [8].Some MSS variants were proposed: An improved Merkle signature scheme (CMSS) [9] builds two chained trees allowing the signature of 2 40 messages and also reduce the runtime of key pair and signature generation.The Merkle signatures with virtually unlimited signature capacity (GMSS) [10] allow to sign a significant number of messages (2 80 ) with one key pair.XMSS [11] introduced a signature scheme with minimal security requirements.A hierarchical based-hash signature XMSS M T [12] allows signing a large but a fixed number of messages.SPHINCS [13] is a practical stateless hash-based signature scheme and introduces a new method to randomize tree-based stateless signatures.SPHINCS has significantly larger signatures, which could make it impractical in some scenarios.In 2016, the work [7] analyzes the state management in hash-based schemes N-times and proposes a hybrid stateless/stateful scheme to protect against unintentional copies of the state of the private key and has smaller and faster signatures.
There are two proposals for standards of hash-based signature schemes: the first one [14] describes the LMS, an adaptation of the one-time signature scheme Lamport-Diffie-Winternitz-Merkle [15].The second one [16] describes XMSS.Therefore, the design and efficient implementation of secure and practical digital signature schemes are crucial for applications that require data integrity assurance and data origin authentication.
Our Contribution.In this work, we present a software implementation of two recent standard proposals for hashbased signatures schemes, LMS and XMSS, using the Intel AVX2 vector instruction set.We use parallel optimization techniques for improving the performance of the underlying hash functions SHA2 and SHA3.We also show how to speed up the main building blocks of LMS and XMSS by taking advantage of the fastest implementation of SHA2 and SHA3.We provide a comparative performance analysis of both schemes.
Organization.The rest of this paper is organized as follows.We describe: the Winternitz One-Time Signature (WOTS) and (WOTS + ) in Section II, the MSS and XMSS in Section III, Hierarchical Signatures Scheme (HSS) in Section IV and LMS and XMSS drafts in Section V. We present the target microarchitectures in Section VI.We discuss the software optimizations in Sections VII and VIII.In Section IX we show the performance results.In Section X we present the conclusions.

II. WINTERNITZ ONE-TIME SIGNATURE (WOTS) AND (WOTS + )
The OTS are used to validate the authenticity of a message by associating a secret private key with a shared public key [14].In these one-time signatures, each private key must be used only one time to sign any given message.As a part of the signing process, a digest of the original message is computed using a cryptographic hash function H, and the resulting digest is signed.The WOTS [15] is a modification of the Lamport One-Time Signature (LOTS) [5].WOTS uses a parameter w which is the number of bits to be signed simultaneously.This scheme produces smaller signatures than Lamport, but increases the number of one-way function evaluations from 1 to 2 w − 1, in each element of the signing key.Hülsing [17] proposed WOTS + , a modification of WOTS that uses a chaining function f e starting from random inputs.This modification allows eliminating the requirement to use a collision-resistant hash function.
The WOTS chaining function is defined as: Similarly, the WOTS + chaining function f e computes e iterations of f K on inputs key K ∈ {0, 1} n chosen randomly, x ∈ {0, 1} n and bitmask bm = (bm 1 , ..., bm w−1 ) chosen randomly, with e ∈ N.Then, the chaining function f e is defined as: These schemes are parameterized by a security parameter n and the Winternitz parameter w ∈ N, for w > 1.The values n and w are used to compute len (number of elements of the signature), where len = len 1 + len 2 .

III. MERKLE SIGNATURE SCHEME (MSS)
MSS [15] is a digital signature scheme that consists of three algorithms: key generation, signing and verification.This scheme constructs a binary tree where the leaves are the verification keys, and the public key is the root of the tree.This key pair can sign/verify messages.A tree of height h and 2 h leaves will have 2 h one-time key pairs.The digest of the one-time verification public key (g(pk 0 ||...||pk t−1 )) will be a leaf of the Merkle tree.

A. MSS key pair generation
First, the signer must select the tree height h ∈ N , h ≥ 2. Merkle uses a cryptographic hash function g : {0, 1} * → {0, 1} n , where n is a positive integer.The treehash algorithm [15] is used to generate the public key that is the root of the tree.The authentication path (Aut) is formed by the sibling right nodes, connecting the leaf up to the tree root, which is used to validate the public key.Aut is saved during the execution of the treehash algorithm.

B. MSS signature generation
The signature generation consists of two steps: first, the signature of the message digest g(M ) is generated using the WOTS signature algorithm and the corresponding secret key sk s of the leaf s.Then, the signature SIG = (s, sig s , Aut) contains the index of the leaf s, the WOTS signature sig s , and the authentication path Aut.In the second step, the next authentication path Aut is generated.This step can be done efficiently with the algorithm proposed by [18] which is a modification of the classic authentication path algorithm proposed by Merkle [6].

C. MSS verification
The signature verification consists of two steps: first, the signature sig s is used to recover a leaf of the tree.Second, the public key of the Merkle tree is validated in the following way.The receiver can reconstruct the path (p 0 , ..., p h ) from leaf s to root.The index s is used to decide the order in which the authentication path is reconstructed.Initially, p 0 = Y s .For each i = 1, 2, . . ., h, p i is computed using the condition (if s/(2 i−1 ) ≡ 1 mod 2) and the recursive formula: Finally, if the value p h is equal to the public key pub, the signature is valid.

D. Extended Merkle Signature Scheme (XMSS)
XMSS [11] is a modification of MSS.This scheme uses a slightly modified version of Winternitz WOTS + described in Section II.XMSS is provably forward-secure and efficient when instantiated with two secure and efficient function families: one second-preimage resistant hash function family G n and the other a pseudorandom function family F n , where The parameters of XMSS are: n ∈ N, the security parameter; w ∈ N(w > 1), the Winternitz parameter; m ∈ N, the message digest length; and h ∈ N, the height of the binary tree.
An XMSS binary tree is constructed to generate the public key pub.The XMSS tree is a modification of the Merkle tree.A tree of height h has h + 1 levels.The nodes on level j are node i,j , for 0 < j ≤ h and 0 ≤ i < 2 h−j .XMSS uses the hash function g K and bitmask (bitmaskTree) bm ∈ {0, 1} 2n , chosen uniformly at random, where bm 2i+2j is the left bitmask and bm 2i+2j+1 is the right bitmask.The bitmasks are the main difference among the others Merkle tree constructions since they allow to replace the collision-resistant hash function family by a second-preimage resistant hash function family [11].The nodes are computed as: node i,j = g K ((node 2i,j−1 ⊕ bm 2i+2j )||(node 2i+1,j−1 ⊕ bm 2i+2j+1 )).
To generate a leaf in the XMSS tree, a Ltree is used.The Ltree [11] uses bitmasks in the same form as in the XMSS tree.The WOTS + public verification keys (pk 0 , . . ., pk len−1 ) are the first len leaves of a Ltree.If len is not a power of 2, then there are not sufficiently leaves to build a binary tree.Therefore, a node that has a no right sibling is lifted to a higher level of the Ltree until it becomes the right sibling of another node.

IV. HIERARCHICAL SIGNATURES SCHEME (HSS)
A hierarchical signature scheme is an N-time signature scheme that uses other hash-based signatures in its construction [7].Some schemes use this constructions as in CMSS [9], GMSS [10], XMSS M T [12], LMS [14] and SPHINCS [13].The basic construction of HSS consists of a tree with d layers of subtrees, for i = 0, . . ., d − 1, where the lower layer is i = d − 1.The trees on top and intermediate layers are used to sign the root nodes of the trees on the respective layer below.Trees on the lowest layer are used to sign the actual messages.All trees can have equal height.
An HSS private key consists of the private keys of each level.The public key is the root of the top level.A signature HSS consists of the public keys of levels 1 to (d − 1), along with the signatures in each level, and the signature of the message M with the private key of the lower level (d − 1).Hierarchical signatures allow for shorter signing time of a message M while offering a larger number of signed messages.

V. LMS AND XMSS DRAFTS
Among the variants of the Merkle scheme, we chose the two standard proposals for hash-based signatures to implement: LMS [14] and XMSS [16].LMS system is an adaptation of the original Lamport-Diffie-Winternitz-Merkle one-time signature system [15] and uses the WOTS and the HSS.XMSS specifies the one-time signature scheme (WOTS + ), a singletree (XMSS) and a multi-tree variant (XMSS M T ) of XMSS.

A. LMS
Leighton and Micali [14], introduce a "security string" that is distinct for each invocation of H to improve security against attacks that amortize their effort against multiple invocations of the hash function H.The following fields can appear in a security string: (I, D IT ER, D P BLC, D M ESG, D LEAF , D IN T R, C, r, q, i, j) as described in [14].The values I, D and C must be chosen uniformly at random, or via a pseudorandom process; r is the node number associated with a particular node of a hash tree; q is set to be the leaf number of the hash tree; i is the index of the private key element (pk[i]); and j is the iteration number used when the private key element is being iteratively hashed.To generate a leaf (leaf [q]) in the LMS tree, the hash functions are used: tmp

B. XMSS
XMSS [16] randomize each hash function call; this means that aside of the initial message digest, for each hash function call a different key and different bitmask is used.These values are pseudorandomly generated using a pseudorandom function that takes a key SEED and a 32-byte address ADRS and outputs a n-byte value, where n is the security parameter.There are three different types of addresses; one type for the hashes used in one-time signature schemes, one for hashes used within the main Merkle-tree construction, and one for hashes used in the Ltrees.

C. Functions used in LMS and XMSS
This section describes the differences between the main functions of both schemes LMS and XMSS.Let F be the chain function used to generate the private keys, sign and verify messages.Let G be the function used to generate the inner nodes of the tree.Let I,r,i,q,j be the security strings defined in Section V-A; S=I+q; Lef t and Right be nodes left and right; KEY be a key; BM , BM 0 and BM 1 be bitmasks; sk i be the WOTS secret key.The function uYstr(X) takes a nonnegative integer X as input and return Y/8 byte strings.
In the LMS, the main functions have the following input sizes: In the XMSS, the main functions have the following input sizes:

D. Keys LMS and XMSS
The sizes of the private key(SK), the verification key (P K) and the signature (Sig) are described below.
In LMS: • SK = (q, SEED sk, SEED I) has 2n + 4 bytes, given that q requires 4 bytes, the seed to generate the secret key and the seed to generate the identifier I have n bytes.
• P K = (I, T [1]) has 3n bytes, given that the identifier I has 2n bytes and the root of the tree (T [1]) has n bytes.
• Sig = (q, sig ots, auth[0], . . ., auth[h − 1]) has (p + 1 + h)n + 4 bytes, given that the index q has 4 bytes, the WOTS signature has a random value C with n bytes and sig with pn bytes; the authentication path has hn bytes.
In XMSS: • SK = (idx, wots sk, SK P RF, root, SEED) has 4n + 4 bytes, given that the index leaf idx requires 4 bytes, and the secret key wots sk, the key SK P RF , the root root and the seed SEED require n bytes.
• P K = (root, SEED) has 2n bytes, given the root and SEED require n bytes.

E. Security considerations
LMS is provably secure in the random oracle model, as shown by Katz [19].From Theorem 8 of that reference: for any adversary attacking arbitrarily many instances of the onetime signature scheme, and making at most q hash queries, the probability with which the adversary can forge a signature with respect to any of the instances is at most q2 (1−8n)  [14].The format of the inputs to the hash function have the property that each invocation of that function has an input that is distinct from all others, with high probability.This property is important for a proof of security in the random oracle model.Let n be the number of bytes in the output of the hash function.Therefore, we use n = 32 to have a security level of 128 bits, even assuming that there are quantum computers that can compute the input to an arbitrary function with a computational cost equivalent to the square root of the size of the domain of that function.
XMSS provides strong security guarantees and is even secure when the collision resistance of the underlying hash function is broken.Parameters are accompanied by a bit security value.The meaning of bit security is that a parameter set grants b bits of security if the best attack takes at least 2 (b−1) bit operations to achieve a success probability of 1/2.Hence, to mount a successful attack, an attacker needs to perform 2 b bit operations on average [20].According to the security proof in [16], it is not sufficient to break the collision resistance of the hash functions to generate a forgery.More specifically, the requirements on the used functions are that F and G are postquantum multi-function multi-target second-preimage resistant keyed functions, F fulfills an additional statistical requirement that roughly says that most images have at least two preimages, P RF is a post-quantum pseudorandom function, H msg is a post-quantum multi-target extended target collision resistant keyed hash function.

VI. TARGET MICROARCHITECTURES
In this section, we describe the microarchitecture details of the Intel processors (Haswell and Skylake) used in this work.The Haswell microarchitecture, launched in 2013, supports the AVX2 vector instruction set, which expanded the integer arithmetic instructions of 128-bit to 256-bit registers.A single AVX2 instruction can operate eight 32-bit values or four 64-bit values at the same time.These instructions allowed four hashes could be processed concurrently for the SHA2-512/SHAKE128 and eight hashes for SHA2-256.
The Skylake microarchitecture, released in 2015, is based on the Haswell and Broadwell microarchitecture [21].Skylake improved the latency of some instructions.Some instructions in Skylake (such as vmov, vpand and vpsllq ) have better throughput and can be used to better schedule the instructions.
In the following, we describe general aspects of these micro-architectures, and the most relevant vector instructions used in this work.
In the late 1990s, processor manufacturers focused their efforts on exploiting data parallelism rather than instruction parallelism.Thus, they incorporated functional units that could execute a single instruction over a set of data.This processing fits into the paradigm of parallel computing known as Single Instruction Multiple Data (SIMD) [22].
In 1997, Intel launched its first set of instructions to implement the SIMD paradigm; called Multimedia eXtensions (MMX).MMX added 64-bit registers and vector instructions that enabled the processing of two 32-bit operations; at that time, the architectures had native 32-bit registers [21].
In 1999, Intel released the Streaming SIMD Extensions (SSE) that included eight 128-bit registers (XMMs); the number of registers was doubled in the next year when the size of native registers increased to 64.In the following years, the SSE has evolved with the launch of the new instructions sets SSE2, SSE3, e SSE4 [21].
In 2011, it was launched the Advanced Vector eXtensions (AVX) instruction set, which introduced significant contributions to the architecture; were included 256-bit registers, called YMMs, that are overlapped on XMMs registers.Also, AVX introduced a new encoding format that allows the use of threeoperand assembly code, making the assignment of registers more flexible.
The code that is compiled for an instruction set can be executed only if both the CPU and the operating system support such set.Some compilers, like GCC, Clang and ICC can perform vector operations automatically (without programmer interference); however, it is not easy to determine whether the code can be vectorized.It is possible to vectorize code explicitly, by writing the code in assembly or using intrinsic functions.The intrinsic functions are primitive operations in the sense that each intrinsic function is translated into one or more machine instructions.

B. Haswell
The Haswell microarchitecture, Intel's 4th generation Core processor family, was launched in early 2013 and presenting a series of improvements on performance and also new instructions.There are instructions in the Bit Manipulation Instruction (BMI), feature group that aid in SHA2 (RORX) and RSA (MULX) performance increases.Also besides, the new instruction set AVX2 that promote vector operations from 128 bits to 256 bits, increasing performance of integer operations [23].AVX2 has permutation and combination instructions that allow moving the words contained in vector registers [24].

C. Skylake
The Skylake microarchitecture was launched in 2015.Skylake offers the following enhancements: larger internal buffers, higher cache bandwidth, higher throughput, better branching predictor, low power consumption, throughput balancing, and reduced floating point.A significant portion of the SSE, AVX, AVX2 and general purpose instructions also had latency improvements [21].

D. Relevant instructions
According to Agner Fog [24], the latency of an instruction is the delay that the instruction generates in a dependency chain, the unit of measure is clock cycles.Another factor that influences performance is throughput, which is the maximum number of instructions of the same type that can be executed per clock cycle when the operands of each instruction are independent of the previous instructions.
In Table I are highlighted some instructions of the AVX2 set that are relevant to the context of the efficient implementation of XMSS and LMS.In this table are shown the latency, the throughput and the execution ports in Haswell and Skylake [24].The ports 0, 1 and 5 in Table I are the most used ports, and therefore, the most critical in determining the efficiency of the implementation.Note that some instructions in Skylake have better throughput; this can be used to schedule instructions to take advantage of this fact.

VII. SOFTWARE OPTIMIZATIONS
In this section, we will discuss the software optimization aspects applied in this work for Intel micro-architectures: Haswell and Skylake.Software optimization is committed to making software faster and smaller and goes beyond of writing a program with few lines of code.One must consider the costs of software development, the programming language used, the security of the code, and the computing power of processors.We will show the most critical parts of our program and how we apply optimizations using AVX2 instructions.
One of the general objectives of this work was to provide techniques that enable the efficient use of vector instruction sets in the implementation of the XMSS and LMS.Because both schemes are based on hash functions, this work shows the results of an efficient implementation that uses 256-bit registers to compute four hash values using SHA2-512/SHAKE128 or eight hash values using SHA2-256 concurrently.
The first optimization for improving both signature schemes uses the computation of multiple hashes at the same time.We call this approach multi-buffer optimization.As data buffers are independent of each other and have messages of the same size, it is possible to take advantage of the data-level parallelization of the hash algorithms.In addition, once the data is loaded into the registers, the data is processed several times by the hash function, performing several iterations of the hash algorithm on the same data, avoiding memory accesses.The result of hash is also returned in the same order as it was sent, making it easy to implement.This optimization was applied in both hash functions SHA2 and SHA3.

A. SHA2 optimizations
The 256-bit vector instructions can process four 64-bit words or eight 32-bit words using only one instruction.The SHA2-512/SHAKE128 algorithm works internally with 64-bit words and SHA2-256 with 32-bit words.Thus, taking advantage of the 256-bit registers, four hashes could be processed concurrently for the SHA2-512/SHAKE128 and eight hashes for SHA2-256.
For example, in the calculation of W [t]: The operations with the values of W [t] for the eight messages can be performed in parallel using 256-bit registers.Each W [t] receives and processes 32-bit values.Thus, it is possible to compute the operations for the calculation of eight hashes at the same time with SHA2-256.

1) Optimizations based on processor execution ports:
Another optimization was to reformulate the code of the functions that compute the hash of the messages, scheduling the execution of the instructions to improve the throughput.The core strategy of our implementation was to analyze the main functions that process the messages to generate the hash.We look for the instructions for these functions, the ports available to execute them, the latency, and throughput, and the dependencies in the code.We have eliminated several dependencies in the SHA2 algorithm code.This approach allows us to parallelize calculations when there is no dependence between instructions.
As an example, the function Sig1(x) makes three calls to the rotation function Rot n (x).Sig1(x) performs three shifts to the left (SL), three shifts to the right (SR) and three OR operations.If each call to the Rot n (x) is performed separately, then the instructions of this function will also be executed separately, underutilizing the available ports on the processor.
Both microarchitectures offer three ports to execute the OR instruction; then we unroll the Rot n (x) function to perform all (SL) first, followed by (SR).Then, three SL values and the three SR values will be available to execute three OR operations in parallel.In particular, Skylake has one additional port to perform the shift operation; then it is possible to execute two shifts at the same time.This analysis of the logical functions of the SHA2 algorithm has resulted in an implementation that takes advantage of the available ports by the processor used and improves the throughput.As an example, we illustrate in Figure 1, the execution sequence of the T 1 function instructions in the Haswell processor; the graph represents dependency in the bottom-up design; where the nodes represent the operations of the instructions and the numbers below each node represent the time in clock cycles.The shift (SH) to the left (SL) and right (SR) of Figure 1 can be calculated in time 1 to 6.According to Table I, this instruction has one cycle of latency and throughput one on Haswell.The three operations OR are executed at time seven because the throughput of this logical operation is three.Since the latency of the OR instruction is one cycle, at time eight the next instructions already can be executed.
Overall, the latency will be one cycle if we look isolated instructions, but if we look at a long chain of instructions of the T 1 function, the total latency will be eleven cycles, where the most critical parts are bit rotations.We can calculate, on average, the total latency in Haswell of the T 1 function, with the following formula: Lat Haswell = 6(SH) + 4/2(ADD) + 8/3(LOGIC).

Lat Haswell ≈ 11 cycles.
Analyzing the latency and the throughput in the Skylake processor for the T 1 function, we observe that some operations have the same latency, but a better throughput.The SH operation has a throughput of two and the operation ADD a throughput of three.Thus, the latency calculation for the Skylake processor can be expressed as: Lat Skylake = 6/2(SH) + 4/3(ADD) + 8/3(LOGIC).
Section IX shows how these optimizations improved the performance of SHA2 in Haswell and Skylake.

B. SHA3 optimization
The Secure Hash Algorithm-3 (SHA-3) is a family of functions that was standardized by NIST in 2015 [26].This family consists of four cryptographic hash functions and two extendable-output functions (XOFs), called SHAKE128 and SHAKE256.The permutation function used in the SHA-3 family is KECCAK-p [1600,24] and is the one responsible for algorithm efficiency.
The permutation function KECCAK-p [1600,24] is composed of five steps that are processed 24 times.The steps are: the θ step, where is computed an XOR of each word of the state with the parity of the left column and the right column rotated one bit; the ρ step, where each word of the state is rotated a fixed amount of bits; the π step where the words of the state are permuted; the χ step, where is processed a non-linear function between the elements of the same row; the ι step, where is computed an XOR between the first element of the state with a constant value.The KECCAK-p [1600,24] function uses a state of 25 words of 64 bits.The use of AVX2 instructions allows to gather four words in the same register and to process four states at the same time.To map four states are required 25 variables of 256 bits; after the mapping, each one of the 25 variables will be composed by one word of each state.
To implement the θ step, we need only XORs and rotations; the AVX instructions vpsllq and vpsrlq can be used to emulate rotation instructions.The other four steps can be implemented in blocks of five words at the same time; it is important to process these words together to avoid a large number of memory accesses because the Intel architecture has 16 256-bit registers and this implementation uses 25 variables.
In the ρ step is required to rotate a different amount of bits in each word of the state.It is possible to process this step in parallel using the AVX2 instructions vpsllvq and vpsrlvq to emulate a variable rotate.The π step permutes the words of the state; as each word of each state was mapped in the same variable, the permutation just change the name of the variables, that in fact, no instruction is required.The χ step is processed in parallel by using one XOR and the vpandn instruction and the ι step is just one XOR of the first word of the state with a constant.The complete code can be found in [27].

VIII. OPTIMIZATIONS IN LMS AND XMSS
The following optimizations were applied to the standard proposal LMS and XMSS.We will show how these optimizations improved the algorithms of key generation, signature, and verification of these schemes.Each of these operations is based on hash functions.Thus, by optimizing the underlying hash functions, we speedup the execution of signature operations of both schemes.
The optimized functions, based on hash algorithms, were: • the keyed hash functions of LMS and XMSS; • the function F of LMS and XMSS; • the functions P RF and P RG of LMS and XMSS; • the function H of XMSS.

A. Optimization of keyed hash functions
The keyed hashed functions of both schemes LMS and XMSS always work with message blocks of the same size.Then, in order to accelerate the computation of the keyed hash function, we made a specialized implementation based on the size of the message input to be processed and set the block values and the pad values.Since the pad values are fixed, there is no need to calculate the pad each time the function is called.Figure 2 shows the optimization of the function F of the XMSS with fixed pad for SHA2-256.In the specialized implementation of these functions, we have created an interface to receive and processes 32-bit message blocks on the SHA2-256 and 64 bits on SHA2-512/SHAKE128.The creation of an implementation of these functions with input, processing, and output with values of 32/64-bits, has significantly reduced processing time over generic functions that receive 8-bit characters because the conversion from 32/64 bit to 8 bits is time-consuming for the processor.
The hash function SHA2 processes blocks of 512 bits while the SHA3/SHAKE128 can handle up to 1344 bits of the message at the same time.An implementation of the function F of XMSS with SHAKE128 needs to process just a single block while in SHA2 must process two blocks to generate the hash value.

B. Optimization of the function F
The function F is used in the chaining function algorithm to generate the verification keys OTS.In the signature, the chaining function algorithm is also used to update the leaves in the authentication path.Thus, reducing the execution time of the function F reflected a significant improvement in the performance of both schemes LMS and XMSS.
Figure 3 shows our implementation of the function F with SHA2-512 which computes four instances in parallel, generating four public keys pk at the same time.We load four secret keys sk into the 256-bit vector registers, perform e iterations of the function F and then store four private keys pk on memory.We can compute the private keys pk in parallel because its generation is independent.Additionally, we store four instances of the pad value in a 256 register because these values will be used multiple times.
. The use of SIMD instructions helped to reduce the runtime of the function F , which are the computationally most expensive parts for key and signature generation.
1) The Function F in the signature and verification: For the generation of OTS keys, the optimization of the function F was simple, because in the generation of keys pk, the function F is performed the same number of times on all elements of the secret key sk.However, for OTS signature and verification generation, the application of the function F in each signature element depends on the message digest M .So we made a small change in the one-time signature and verification algorithms to apply the function F in parallel in elements of the signature.
We have added a sort algorithm before the function F .The vector msg, which contains how many times the function F will be performed, is sorted according to the number of applications of the function F .After sorting, we select the msg elements that have the same value to run the F in parallel.The function F is executed, and at the end, the signature elements are scaled according to the original order.

C. Optimizing the functions PRG and PRF
The P RG pseudo random generator generates the secret key elements sk using the P RF function.The secret keys are calculated as sk[i] = P RF (S, toByte(i, 32)), to 0 ≤ i ≤ len.The string S is a secret value generated randomly and is used as a seed to generate all keys sk.The value i is concatenated with the value S to generate the values of sk.Since there is no dependence on the generation of the elements of sk, we could generate eight values of sk at the same time with SHA2-256 and four values of sk with SHAKE128, reducing the execution time of these functions.
The P RF function is used to generate the pseudorandom values.This generator was implemented in key generation and the signature of the LMS and the XMSS.The

D. Optimizing the implementation of the Ltree
The Ltree from XMSS [11] is used to generate the leaves of XMSS.In this section, we show an optimization in the generation of the Ltree for improving the performance of generating each leaf of the XMSS tree.This optimization was suggested in [13].The function G is applied to each concatenation of children nodes to generate the parent node.Then, we modified the Ltree algorithm to perform eight evaluations of the function G at the same time.We generate eight internal nodes at the same time, from 16 children nodes, which are concatenated two by two.If the number of remaining nodes is not multiple of 16, we generate the next internal nodes one by one as the traditional way.If len is not a power of two, then there are not sufficient leaves to build a binary tree.Therefore, a node that has not a right sibling is lifted to a higher level of the Ltree until it becomes the right sibling of another node.Figure 4 shows the optimization performed in the generation of the Ltree for w = 16 and l = 67.

IX. PERFORMANCE RESULTS
This section shows the experimental results for LMS and XMSS of our implementation using AVX2 instructions.These results were obtained by running benchmarks on a Haswell processor Core i7-4770 at 3.4 GHz and a Skylake processor Core i7-6700K at 4.0 GHz.The Intel Turbo Boost and the Intel Hyper-Threading technologies were disabled to ensure the reproducibility of the results.Our implementation was written in C language and compiled using the GNU C Compiler v6.2.0.In our work, the runtimes for signing and verifying for H > 20, are calculated using the arithmetic average of the first one million signatures.

A. Scheme parameters
We have selected a set of parameters provided by the drafts LMS [14] and XMSS [16].The parameters selected were: w ∈ N, the Winternitz parameter; h ∈ N, the total height of the tree; d, the number of layers; n, the output of the hash function.
The output of the chosen hash function influences system security.Considering classic computers, the parameter n = 32 provides a 256-bit security level and n = 64 provides 512bit security level.Considering quantum computers, for 128-bit security, we use the SHA2-256 and SHAKE256 functions in the LMS and the SHA2-256 and SHAKE-128 functions in the XMSS.For 256-bit security, we use the SHA2-512 and SHAKE256 functions in XMSS.
The value w influences the execution time and the size of the signature.Larger values for w imply larger execution times, but smaller signature sizes.The H and d values affect the signature size.The output n of the chosen hash function also influences the size of the public, private, and signature keys.We used w = 2 or w = 4 for LMS and w = 16 for XMSS.The maximum height of XMSS M T was H = 60 and the maximum number of layers was d = 12 according to [16].

B. SHA2/SHA3 implementation results
In Table II, we show the performance of the implementation of the SHA2-256 and the SHAKE128 single-buffer (64bit) and multi-buffer(256-bit) on Haswell and Skylake.The input sizes of these functions have been selected according to the size of the functions F and G of LMS and XMSS.The function F processes message of 104 bytes in LMS and 96 bytes in XMSS.The G function processes message of 133 bytes in LMS and 128 bytes in XMSS.Then, the function F on SHA2-256 processes two data blocks (512 bits) and the functions G processes three data blocks (512 bits).In SHAKE128, the functions F and G processes a single block (1344 bits).Therefore, the implementation of F and G singlebuffer functions with SHAKE128 has better results.
The computation of the functions F and G with SHA2-256 were faster than the versions using SHAKE128, for multi-buffer implementations.The speedup of the function F with SHAKE single-buffer is 1.2× compared to SHA2-256 single-buffer implementation.SHA2-256 processes 8 independent hash values simultaneously and SHAKE128 processes only four independent hash values in parallel.Also, the speedup with SHA2 multi-buffer is approximate 4.6×, and with SHAKE multi-buffer is approximate 2.4× compared to the single-buffer.
We also note in Table II that the function F with SHA2-256 presents a speedup of 4.6× per hashing on Haswell and 4.9× per hashing on Skylake.Performance on Skylake is better because of the computer architecture features presented in Section VI.
In the following sections, we will show that the performance obtained in the multi-buffer implementation of the hash functions impacts on the performance of the key generation, signature, and verification algorithms.

C. XMSS/LMS implementation results
In this section, we present the results of our XMSS/LMS single-buffer (64-bit) and multi-buffer (256-bit) implementation with the hash functions SHA2 and SHAKE.
In Table III, we compare our XMSS single-buffer (64bit) implementation with the single-buffer implementation presented on the author's website [28].The results were obtained on a Haswell processor.A speedup of 1.4× is observed for key generation, signature, and verification of our implementation over the implementation [28].This improvement was due to the specialized implementation of each function of XMSS.Table IV presents the results of the single-buffer (64bit) and multi-buffer (256-bit) implementation of XMSS with SHA2 and SHAKE for different security levels.We compare our results using single-buffer and using a multi-buffer for h = 20.
We show that the speed up due to the multi-buffer optimization, for key generation, signing, and verification respectively, is: with SHA2-256 ranges from 4.4×, 4.2× and 2.4×; with SHAKE128 ranges from 2.6×, 2.5× and 2.0×; with SHA2-512 ranges from 2.4×, 2.4× and 2.0×; and with SHAKE256 ranges from 3.3×, 3.3× and 2.8×.For key generation, the performance is greater because the function F SIM D executes the same amount of times in all elements of the secret key.However, in the WOTS signing and verification process, it was necessary to sort the elements of the signature before of the function F SIM D, because the number of applications of the function depends on the bits of the message.
The runtimes with SHA2-512/SHAKE256 are larger than using SHA2-256/SHAKE128, but we get a higher level of security (256-bit security level).For 128-bit security level, the performance of the XMSS single-buffer with SHAKE128 is better than with SHA2-256.However, the multi-buffer version of SHA2-256 has better runtimes than the multi-buffer version of SHAKE128, due to the performance of these functions presented in Table II.

D. Hierarchical signatures scheme implementation results
In this section, we show the performance results of our software for both schemes HSS and XMSS M T on the Skylake processor.We use the parameter w = 4 for LMS and w = 16 for XMSS, then the length of the signature OTS is len = 67 for both schemes.In Table VI, are given the runtimes of HSS multi-buffer using the hash functions SHA2-256 and SHAKE256 at 128-bit security level.Notice that by increasing the number of layers the runtime is reduced for key generation and signing; however, the runtime for verification increases because the signatures of all layers must be checked.For subtrees that have the same height, the signature time remains constant.Then, increasing the tree height allows producing more signatures without impacting the performance of signing and verifying.Additionally, by increasing the number of layers the size of the secret key and signature is larger because they store information of each layer.

E. Analysis of results
In this section, we examine the results with AVX2 and compare the schemes LMS and XMSS.
Figure 5 shows the performance of XMSS/LMS multibuffer × single-buffer with AVX2.The multi-buffer implementations with SHA2-256 have better performance because it allowed executing eight hashes at the same time whereas the SHAKE allowed to perform only four hashes in parallel.Table VIII shows a summary of the code size, in C language, for the main functions of the schemes LMS and XMSS.We execute the GNU command nm in the Linux compiler, and it returned the size of the objects in a file.Note that the LMS has code size approximately 1.23× greater than the XMSS code for the single-buffer implementation because the LMS uses two hash functions H leaf and H node for the generation of the leaves of the tree and XMSS uses only Ltree.Table IX shows the keys length and runtimes of both schemes LMS and XMSS for a tree with height h = 20 and 128-bit security level (n = 32 bytes).If one uses w = 4 for LMS and w = 16 for XMSS, then the length of the signature OTS is len = 67 for both schemes.Note that for the selected parameters, LMS secret key size is shorter than XMSS secret key.On the other hand, LMS public key size is larger than XMSS public key, and the signature key of the both schemes has the same size.Since LMS has fewer calls to the underlying hash function than XMSS, the implementation of LMS with SHA2-256 is approximate 2.7× faster than the implementation of XMSS with SHA2-256.In addition, LMS does not use a Ltree tree and performs fewer operations on generating the internal nodes of the binary tree.
According to the results presented, the use of AVX2 contributes significantly to the implementation of the proposals standard for the Merkle scheme and its variants.For 128bit security level, if the computer does not have instructions AV X2, then the implementation of the schemes LMS/XMSS with the hash function SHAKE128 single-buffer is a good option for presenting better runtimes.However, if these instructions are available on the computer, the implementation of the LMS/XMSS multi-buffer with SHA2-256 would be the best option.Also, if the choice of the signature scheme is based on the runtimes, the LMS could be used because of better execution times.However, if there is a greater preoccupation with information security, XMSS would be a better option because the XMSS scheme provides strong security guarantees, XMSS is existentially unforgeable under adaptively chosen message attacks (EUCMA), it is forward security, and it is considered safe even when the collision resistance of the underlying hash function is broken.

X. CONCLUSION
The emerging transition to post-quantum cryptography requires digital signature schemes that are immune to quantum computers.Hash-based signatures schemes are promising candidates for replacing the current signatures schemes because they do not depend on arithmetic operations such as the problem of factorization of integers.These schemes are the object of current standardization efforts.Many improvements have already been made to the MSS making it feasible for many nowaday applications.However, some additional issues also appear as some signatures, storage resources, state management and slow generation of the key pair, leading to an important question: How can we apply the Merkle scheme in current applications?New variants have emerged to improve the storage resource problem, such as the use of pseudo-random generators, reducing key size.The use of multi-trees, allowed to increase the number of signatures and reduce the time of generation of signature and verification keys.
In this work, we present an efficient software implementation of the Merkle scheme proposals (LMS and XMSS) using the set of vector instructions AVX2 on Intel processors.We show that our implementation presents significant improvements in the execution times of the key generation algorithms, signature, and verification of these standards.We have used several optimization techniques for increasing the performance in the software of both schemes.Our results show the feasibility of using these post-quantum schemes in practical applications.
+ signature generation To generate the signature of a message M , first compute the message digest d = g(M ).Then, d is split into len 1 binary blocks, resulting in d = (m 0 ||...||m len1−1 ), where || denotes concatenation.The checksum c is computed and added to d, where c can be divided into len 2 blocks c = (c 0 || . . .||c len2−1 ).
P RF : Hash(toByte(3, n)||KEY ||M ) function receives the values of KEY and M as input.We created the function P RF SIM D, which receives eight values of KEY and eight values of M in SHA2-256 and four values in SHAKE128.Then, processes these values in parallel and returns eight or four pseudorandom values.

TABLE IV .
PERFORMANCE FIGURES OF XMSS FOR PARAMETERS h = 20 AND w = 16 FOR DIFFERENT SECURITY LEVELS ON SKYLAKE

Table V
represents the timing results of our software for the multi-buffer version of LMS with SHA2-256 and SHAKE256 at 128-bit security level.We observed that the acceleration obtained with SHA2-256 multi-buffer and w = 4 is 4.2×, 4.1× and 2.0× for key generation, signature, and verification respectively.The implementation with SHAKE256 multi-buffer and w = 4 ranges from 2.7×, 2.6× and 1.8× for key generation, signing, and verification.A larger value of w results in shorter signatures but slower overall signing operations; it has little effect on security.For

TABLE V .
PERFORMANCE FIGURES OF LMS 128-BIT SECURITY LEVEL FOR DIFFERENT VALUES OF w ON SKYLAKE

TABLE VI .
PERFORMANCE FIGURES OF HSS MULTI-BUFFER FOR w = 4 AND DIFFERENT VALUES OF h AND d ON SKYLAKE

TABLE VII .
PERFORMANCE FIGURES OF XMSS M T MULTI-BUFFER FOR w = 16 AND DIFFERENT VALUES OF h AND d ON SKYLAKE

TABLE VIII .
SIZE OF LMS AND XMSS CODES

TABLE IX .
SIZES AND RUNTIMES OF THE LMS AND XMSS WITH SHA2-256 FOR 2 20 SIGNATURES