Table of Contents

Optimizing bitslice DES S-box expressions

The algorithm consists of two major parts:

Represent the truth table of a logical function of 5 inputs as a 32-bit value x. Let complexity of x, which we'll denote with C(x), be the minimum (or near-minimum) number of logical gates for a particular input. FIXME What is meant by “for a particular input”? Need to clarify. OK :?:

Standard inputs are: 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF. For these, C(input) is 0.

For example, complexity of 0x0F00F0FF is 2 for the SSE instruction set and standart inputs ( C(0x0F00F0FF)==2 ):

0x0F000F00 = 0x0F0F0F0F & ~0x00FF00FF;
0x0F00F0FF = x0F000F00 ^ 0x0000FFFF;

If We fix 0x0F00F0FF as result, We will get particular inputs: 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF, 0x0F000F00, 0x0F00F0FF

For these inputs C(0x0F000F00)==0 and C(0x0F00F0FF)==0.

In order to get all logical functions of complexity n, the algorithm must try:

For example, for n==4:

Also, algorithm should filter out the logical functions, whose complexity is known and less than n.

For filtering is a very suitable array of bytes with size 4294967296 (2^32), lev[] in sbox32.cpp.

Functions with small complexity (for example from 0 to 3) can be placed in separate arrays of 32-bit words for speed-up, a[] in sbox32.cpp

Target functions are marked by msb (most significant bit) of lev[x].

Search time of breadth-first search does not depend on count of search arguments (target functions). Search time depend on minimal complexity of arguments. If argument too complex, the search can take from several hours to several days on Intel (R) Core i7 CPU. If We want to complete the search in a reasonable time, We should restrict maximal complexity manually for non-first searches.

After successful finding some targets, result write to log, and 1-3 functions (dependency) on each target also marked as targets. And iterations go downto level 1. Double iterations (up and down) does not significant slow down search process: hi-level and most time-consumed iteration executed only once.

Statistics for x86 instruction set:

Statistics for SSE instruction set:

Statistics for GPU instruction set (and, or, xor, not, selb):

On each step We can find from 0 to N sets of functions. 0 - if level limit reached. Each set consists of one target function and C(X)-1 intermediate functions. sbox32 write this sets to files named result.level. These results can be used by following runs of sbox32.

For i=0 to N-1 do begin
 If all targets found then generate s-boxes
 else begin
  Fix particular inputs as previous inputs+set(i)
  Exclude found function from targets (in sbox32.cpp used 32-bit mask)
  Go deep
 end if
end for

Relevant files

Generated S4 expressions using only and/or/xor/not/and-not "gates", code generation and instruction scheduling programs for x86-64 16-register SSE2 and MMX (directly reusable for plain 8-register SSE2), and hand-tuned code

FIXME This should probably not be on the same wiki page with the algorithm description (perhaps the latter will be moved to a sub-page?)