Questions about Superoptimization

2016-12-30

These questions are about this series of posts:

Function-Directed Enum Labeling — the problem
Four Functions and Enum Labelings — solutions
Bespoke Superoptimization — how to find solutions

If you have any thoughts, leave a comment. I'm using Reddit for blog comments.

What's the shortest function that implements the spec? We found one with four instructions, but didn't rule out the possibility of one with three.
In general, how does the minimum solution length relate to the dimensions of the problem (the distribution of 221 values into 21 groups, and 8 bits)? When does the problem become impossible?

This is a combinatorics problem. Fifteen years ago I would have asked on a Usenet math newsgroup.

A reason not to use this technique is that the numbers will change as the code evolves, and I don't necessarily want to do a new search every time.

Can we optimize the search? I noticed symmetries in the results — i.e. there were 12 functions that had the same output distribution as (id & 230) & (id + 13). Most search spaces for superoptimization are so big that they need to use stochastic techniques, but we could probably do some math for this specific problem to prune the search space.
What if the enum labeling was constrained by two or more lookup tables? That is, how would we find two expressions to replace two lookup tables on the same set of identifiers? This will probably happen in oil, and that's another reason I won't use this solution right away.
Our problem was to map 221 arbitrary IDs to 21 arbitrary IDs, which is an easier problem than mapping 221 integers to 21 integers. So easy that we can solve it with single-threaded Python.

What other kinds of functions can be easily superoptimized?

How does the solution compare with perfect hashing? I believe perfect hash functions would be longer because the problem is more constrained. And some algorithms generate hashes that require lookup tables, which defeats the purpose of this optimization. (The Python code in this blog post may be useful.)
Do any compilers do this? They might do other kinds of superoptimization, but this problem appears unique because allowing the enum labels to vary makes the search exponentially easier.

C and C++ specify that unlabeled enums are assigned values from 0 to N-1, but the compiler should be able to detect if the real value is actually "observed" by any code. Strongly-typed enums in C++ 11 require an explicit static_cast<int> to be used as an integer.

What other techniques are there for replacing lookup tables with functions? Several google searches yielded nothing.
Finally, what's the difference between this code generated by GCC versus Clang? I wanted to verify that we're actually removing a memory access, so I compiled the code with -O3 with both compilers.

The Clang code produces a sequence of four instructions as you might expect (xor add and and), but GCC produces a sequence with a movzx instruction in the middle.

I suspect there's no real difference: the movzx is a register-to-register operation, and it happens at the end of the Clang code too. So we have indeed removed a memory access with our algorithm.

This function:

kind = 175 & id & ((id ^ 173) + 11)

compiles to this:

blog-code/id-kind-func$ ./run compare-gcc-clang
...

GCC

0000000000400580 <_Z10LookupKind2Id>:
  400580:       89 f8                   mov    eax,edi
  400582:       81 e7 af 00 00 00       and    edi,0xaf  # 0xaf is 175
  400588:       83 f0 ad                xor    eax,0xffffffad
  <b>40058b:       0f b6 c0                movzx  eax,al</b>
  40058e:       83 c0 0b                add    eax,0xb  # 0xb is 11
  400591:       21 f8                   and    eax,edi
  400593:       c3                      ret
  400594:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  40059b:       00 00 00
  40059e:       66 90                   xchg   ax,ax

...
Clang

00000000004005f0 <_Z10LookupKind2Id>:
  4005f0:       40 88 f8                mov    al,dil
  4005f3:       34 ad                   xor    al,0xad  # 0xad is 173
  4005f5:       04 0b                   add    al,0xb   # 0xb is 11
  4005f7:       40 20 f8                and    al,dil
  4005fa:       24 af                   and    al,0xaf  # 0xaf is 175
  4005fc:       0f b6 c0                movzx  eax,al
  4005ff:       c3                      ret

Versions:

blog-code/id-kind-func$ ./run.sh show-versions
c++ (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
...

clang version 3.8.0 (tags/RELEASE_380/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
...

Is there any other difference between these two compilations? I'm not sure what the nop and xchg are for in the GCC listing -- they both look like no-op instructions.

I'm looking forward to any insight you have into these questions.