Read an Excerpt
Chapter 5: Performance of Twofish
Twofish has been designed from the start with performance in mind. It is efficient on a variety of platforms: 32-bit CPUs, 8-bit smart cards, and dedicated VLSI hardware. More importantly, though, Twofish has been designed to allow several layers of performance tradeoffs, depending on the relative importance of encryption speed, key setup, memory use, hardware gate count, and other implementation parameters. The result is a highly flexible algorithm that can be implemented efficiently in a variety of cryptographic applications.
All these options are interoperable; these are simply implementation tradeoffs and do not affect the mathematics of Twofish. One end of a communication could use the fastest Pentium II implementation, and the other the cheapest hardware implementation.
5.1 Performance on Large Microprocessors
Table 5.1 gives Twofish's performance, encryption or decryption, for different key scheduling options and on several modern microprocessors using different languages and compilers. This table shows our results for many different implementations. Each implementation is presented on a single line. The first column gives the CPU the implementation was run on (PPro/II = Pentium Pro/Pentium II, U-SPARC = Ultra- SPARC, PPC = Power PC). The second column is the programming language (ASM = assembly language, MS C = Microsoft Visual C++ 4.2, BC = Borland C 5.0, C = standard C compiler). The keying options are explained below. The code size column contains the approximate total code size (in bytes) of the routines for encryption, decryption, and key setup, where available. All remaining numbers in the row are inclock cycles. For each key size we show the number of clock cycles required for the key setup, and the number of clock cycles required to encrypt a single block. The times for encryption and decryption are identical in assembly, and encryption is slightly slower than decryption in C; only the encryption (i.e., the larger) number is given. There is no time required to set up the algorithm except for key setup. The time required to change a key is the same as the time required to set up a key.
For example, on a Pentium Pro a fully optimized assembly-language version of Twofish can encrypt or decrypt data in 258 clock cycles per block, or 16.1 clock cycles per byte, after a 12700-clock key setup (equivalent to encrypting 45 blocks). On a 200 MHz Pentium Pro microprocessor, this translates to a throughput of just under 90 Mbits/sec.
5.1.1 Keying Options
We implemented four different keying options. All of our keying options precompute Ki for i = 0,...,39 and use 160 bytes of RAM to store these constants. The differences occur in the way the function g is implemented. There are several other possible keying options, each with slightly different setup/throughput tradeoffs, but the examples listed below are representative of the range of possibilities.
Full Keying. This option performs the full key precomputations. Using 4 Kbytes of table space, each S-box is expanded to a 8-by-32-bit table that combines both the S- box lookup and the multiply by the column of the MDS matrix. Using this option, a computation of g consists of four table lookups, and three XORS. Encryption and decryption speeds are constant regardless of key size.
Partial Keying. For applications where few blocks are encrypted with a single key, it may not make sense to build the complete key schedule. The partial keying option precomputers the four S-boxes in 8-by-8-bit tables, and uses four fixed 8-by- 32-bit MDS tables to perform the MDS multiply. This reduces the key-schedule table space to 1 Kbyte. For each byte, the last of the q-box lookups is in fact incorporated into the MDS table, so only k of the q-boxes are incorporated into the 8-by-8-bit S-box table that is built by the key schedule. Encryption and decryption speeds are again constant regardless of key size.
Minimal Keying. For applications where very few blocks are encrypted with a single key, there is a further possible optimization. Compared to partial keying, one less layer of q-boxes is precomputed into the S-box table, and the remaining q-box is done during the encryption. For the 128-bit key this is particularly efficient, as precomputing the S-boxes now consists of copying the table of the appropriate q- box and XORing it with a constant (which can be done word-by-word instead of byte-by-byte). This option uses a 1 Kbyte table to store the partially precomputed S-boxes. The necessary key bytes from S are of course precomputed, as they are needed in every round.
Zero Keying. The zero keying option does not precompute any of the Sboxes, and thus needs no extra tables. Instead, every entry is computed on the fly. The key setup time consists purely of computing the Ki values and S. For an application that cannot have any key setup time, the time it takes to encrypt one block is the sum of the key setup time and encryption time for the zero keying option. Compiled. In this option, available only in assembly language, the subkey constants are directly embedded into a key-specific copy of the code, saving memory fetches and allowing the use of the Pentium LEA opcode to perform both the PHT and the subkey addition all in a single clock cycle. Some additional setup time is required, as well as about an extra 5000 bytes of memory to hold the "compiled" code, but this option allows the fastest execution time of 258 clocks per block on the Pentium Pro. The setup time for a Pentium more than doubles over the full keying case because of its smaller cache size, but the Pentium MMX setup time is still comparable to the Pentium Pro setup time. However, almost all the extra time required is consumed merely in copying the code; the table does not reflect the fact that, once a single key has been initialized, future keys can be compiled at a cost of only a few hundred more clocks than full key schedule time.
5.1.2 Code and Data Size
As shown in Table 5.1, the code size for a fully optimized Twofish implementation on a Pentium Pro ranges from about 8450 bytes in assembler to 14100 bytes in Borland C. In assembler, the encryption and decryption routines are each about 2250 bytes in size; the remaining 4000 bytes of code are in the key scheduling routines, but about 2200 bytes of that total can be discarded if only 128-bit keys are needed. In Borland C, the encryption and decryption routines are each about 4500 bytes in length, and the key schedule routine is slightly less than 5000 bytes. Note that each routine fits easily within the code cache of a Pentium or a Pentium Pro. These sizes axe larger than Blowfish but very similar to a fully optimized assembly language version of DES. Note that, with the exception of the zero keying option, the code sizes in the table are for fully unrolled implementations; in either C or assembler, it is possible to achieve significantly smaller code sizes using loops with round counters, at a cost in performance.
In addition to the code, there axe about 4600 bytes of fixed tables for the MDS matrix, qO and qj required for key setup, and each key requires about 4300 bytes of key-dependent data tables for the full keying option. During encryption or decryption, these tables fit easily in the Pentium data cache. The other keying options use less data and table space, as discussed above.
5.1.3 Large Memory Implementations
For machines with sufficient RAM and a good memory cache subsystem, large precomputed tables can be used to reduce the key setup time for Twofish even further. For example, in compiled, full, or partial keying modes, the first two levels of qO and q1 lookups with one key byte can be precomputed for all four S-boxes, requiring 256 Kbytes of table (four tables of 64 Kbytes each). This approach saves roughly 2000 clocks per key setup on the Pentium Pro in assembly language; details are shown in Table 5.2. For instance, the compiled mode key setup for 128-bit keys on a Pentium Pro can be reduced from 8700 clocks to 6500 clocks. Unfortunately, the savings on Pentium and Pentium MMX CPUs seems to depend on the performance of the L2 cache subsystem (which is included in the Pentium Pro and thus is more predictable); the gain seems to range from 500 clocks down to nothing. Implementing this "big table" version in C also leads to savings of about 1000 clocks per key setup on the Pentium Pro, depending on the quality of the compiler; again, Pentium performance gains are minimal.
For the ultimate in key agility, a full 256 Mbytes of precomputed tables could comprise all four S-boxes for the final two stages of qO, q1, covering all 216 key byte possibilities for the 128-bit key case, and including the MDS matrix multiply. With a good memory subsystem, such a version should cut another 1000 clocks or so out of the above key setup times. Clearly, this is a fairly expensive solution (at least with today's technology), but it illustrates the flexibility of Twofish very nicely.
5.1.4 Total Encryption Times
Any performance measures that do not take key setup into account are only valid for asymptotically large amounts of text. For shorter messages, performance is the sum of key setup and encryption. For very short messages, the key setup time can overwhelm the encryption speed.
Table 5.3 gives Twofish's performance on the Pentium Pro (assembly language version), both 128-bit key setup and encryption, for a variety of message lengths. This table assumes the best of our implementations (not including the large-memory implementations) for the particular length of text.
5.1.5 Hash Function Performance
Hash functions built from block ciphers generally require the algorithm to be uniquely keyed for each text block [Pre93, Sch96]. Assuming the chaining variables are the plaintext and ciphertext, and the block to be hashed is the key, Twofish can hash text at a rate of 175 clock cycles per byte on a Pentium, assuming a 128-bit key, and 132 clock cycles per byte on a Pentium Pro/II.
5.1.6 Language, Compiler, and Processor Choice
As with most algorithms, the choices of language and compiler can have a huge impact on performance. It is clear that the Borland C 5.0 compiler chosen as the standard AES reference is not the best optimizing compiler. For example, the Microsoft Visual C++ 4.2 compiler generates Twofish code that is at least 20 percent faster than Borland on a Pentium computer, with both set to optimize for speed (630 clocks per block for Microsoft Visual C++ 4.2 versus 870 clocks per block for Borland C 5.0); on a Pentium Pro/II, the difference between the compilers is not quite as large (e.g., 600 clocks/block vs. 640 clocks/block), but it is still significant. Part of the difference stems from the inability of the Borland C compiler to generate intrinsic rotate instructions, despite documentation claiming that it is possible. This problem alone accounts for nearly half of the speed difference between Borland and Microsoft. The remaining speed difference comes simply from poorer code generation. The Borland C compiler is uniformly slower than Microsoft's compiler. The encryption speed in Microsoft C of 40 Pentium clocks per byte (i.e., 630 clocks/block at 16 bytes/block) is over ten percent faster than the best known DES assembly language implementation on the same platform. However, coding the Twofish algorithm in assembly language achieves speeds of 258 clocks, achieving a very significant speedup over any of the C implementations.
To make matters even more complicated, the assembly language that optimizes performance on a Pentium (or Pentium MMX) is drastically different from the assembly language required to maximize speed on a Pentium Pro or Pentium II, even though the final code size and speed achieved on each platform are almost identical. For example, the Pentium Pro/II CPUs can perform only one memory read per clock cycle, while the Pentium and Pentium MMX can perform two. However, the Pentium Pro/II can perform two ALU operations per clock in addition to memory accesses, while the Pentium can process only a total of two ALU operations or memory accesses per clock. These (and other) significant architectural differences result in the fact that running the optimal Pentium Twofish encryption code on a Pentium Pro results in a slowdown of nearly 2:1-and vice versa! Fortunately, it is relatively simple to detect the CPU type at run-time and select which version of the assembly code to use. Another anomaly is that the key schedule setup time is considerably faster (43 percent) on the Pentium MMX than on the Pentium, not because the key schedule uses any MMX instructions, but simply because of the larger cache size of the MMX chip.
Empirically, there also seems to be some anomalous behavior of the C compilers. In almost all cases, the encryption and decryption routines in C achieved speeds within a few percent of each other. However, there were cases in the table where the two speeds differed by considerably more than ten percent (we used the larger number in the table), which is very odd because the "inner loop" C code used is virtually identical. We also noted several cases where compiler switches that seemed unrelated to performance optimization sometimes caused very large changes in timings.
It should be noted that performance numbers for Pentium II processors are almost identical in all cases to those for a Pentium Pro, which is not surprising since Intel claims they have the same core. The Pentium and Pentium MMX achieve almost identical speeds for encryption and decryption, although, as noted above, the MMX key setup times are faster, due mainly to the larger cache size.
The bottom line is that, when comparing the relative performance of different algorithms, using the same language and compiler for all implementations helps to make the comparison meaningful, but it does not guarantee a valid measure of the relative speeds. We have listed many different software performance metrics across platforms and languages to facilitate speed comparisons between Twofish and other algorithms. Our belief is that, on any given platform (e.g., Pentium Pro), the assembly-language performance numbers are the best numbers to use to gauge absolute performance, since they are unaffected by the vagaries and limitations of the compiler (e.g., inability to produce rotate opcodes). High-level languages (e.g., C, Java) axe also important because of the ease of porting to different platforms, but once an algorithm becomes standardized, it will ultimately be coded in assembly for the most popular platforms.
5.2 Performance on Smart Cards
Twofish is ideally suited for smart cards. It can fit on the smallest smart cards while exhibiting reasonable performance, and can take advantage of more RAM or more powerful processors with increased performance. It can operate efficiently in environments that require rapid key changes, and can be implemented in dedicated hardware in only a few gates.
We implemented Twofish on a 6805 CPU, which is a typical smart card processor, with several different space-time tradeoff options [SKW+99b]. Our results are shown in Table 5.4. The code size includes both encryption and decryption. The block encryption and decryption times are almost identical. If only encryption is required, minor improvements in code size and speed can be obtained. The only key schedule precomputation time required in this implementation is the Reed-Solomon mapping used to generate the S-box key material S from the key M, which requires slightly over 1750 clocks per key. This setup time can be made considerably shorter at the cost of two additional 256-byte ROM tables. It should also be observed that the lack of a second index register on the 6805 has a significant impact on the code size and performance, so a different CPU with multiple index registers (e.g., 6502) might be a better fit for Twofish.
5.2.1 RAM Usage
For any encryption algorithm, memory usage can be divided into two parts: that required to hold the expanded key, and that required as working space to encrypt or decrypt text (including the text block). In applications where a smart card holds a single key for a long period of time, the key can be put into EEPROM or even ROM, greatly reducing RAM requirements. Most applications, however, require the smart card to encrypt using session keys, which change with each transaction. In these situations, the expanded key must be stored in RAM, along with working space to perform the encryption.
Twofish - the 128-bit key version - can be implemented in a smart card in 60 bytes of RAM. This includes the text block, key, and working space. If a slightly expanded key (16 bytes of the key plus another eight bytes of the Reed-Solomon results (S)) can be stored in ROM or EEPROM, then Twofish can be implemented in only 36 bytes of RAM. In either case, there is zero key setup time for the next encryption operation with the same key.
Larger key sizes require more RAM to store the larger keys: 36 bytes for 192-bit keys and 48 bytes for 256-bit keys. If these applications can store key material in ROM or EEPROM, then these key lengths can be implemented on smart cards with only 36 bytes of RAM. All of this RAM can be reused for other purposes between block encryption operations.
For smart cards with larger memory to hold key-dependent data, encryption speed can increase considerably. This is because the round keys can be precomputed as part of the expanded key, requiring a total of 184 bytes of key memory. As shown in Table 5.4, this option nearly halves the encryption time. If the smart card has enough additional memory available to hold 1 Kbyte of precomputed S-box in either RAM, ROM, or EEPROM (for a total of 1208 bytes), performance improves further. Finally, as shown in the final row of Table 5.4, if the entire precomputed S-box plus MDS table can be held in memory (3256 bytes), the speed can again be increased slightly more. It should be noted that some of these "large RAM" implementations save 512 bytes of code space by assuming that certain tables are not required in ROM, with the entire precomputation being performed instead on the host that sets the key in the smart card. If the smart card has to perform its own key expansion the code size will increase. This increase has its own space/time tradeoff options.
This flexibility makes Twofish well-suited for both small and large smartcard processors: Twofish works in the most RAM-poor environments, while at the same time it is able to take advantage of both moderate-RAM cards and large-RAM cards....