Low-level GPU programming

This is content moved from our archived GSoC ideas page:

This project was worked on under GSoC 2013, but there's more to do on it.

Starting in 2011, we've made considerable progress on adding GPU support to John the Ripper, via CUDA and OpenCL. In the process, we've also identified limitations of these high-level approaches. For example, for DES-based crypt(3) hashes, there's substantial performance improvement from specializing the code to a given salt value. While we can specialize OpenCL source code and build per-salt OpenCL kernels at runtime, this takes tens of minutes for the 4096 salt values. This delays program startup or at least the time until the programs gets to running at full speed. For another example, for bcrypt hashes we (and two other projects) have achieved only CPU-like performance on current high-end GPUs. While there's good explanation for that (not enough local memory to fully use the SIMD units and to hide the latencies), we're not entirely convinced that nothing better can be done by programming AMD GCN GPUs (such as the HD 7970) at a level below OpenCL - that is, at AMD IL or/and AMD GCN ISA level. For example, to what extent is the limitation of 256 VGPRs per work-item inherent to GCN? Can we bypass it with a non-standard programming model (e.g. have a work-item access what would normally be another work-item's VGPRs)? (Apparently not, or at least not easily.) Since the combined size of VGPRs per CU is 4x larger than the size of local memory per CU, yet there's support for indexed access to VGPRs, this may let us run more concurrent instances of bcrypt (up to 5x more?) and thereby achieve greater performance.

A sub-task here is to explore ways to write lower-level GPU code, possibly with specific focus on AMD GCN or/and on NVIDIA Maxwell, and also to analyze OpenCL-generated code at a low level to identify its shortcomings. We may also produce custom development tools, such as to allow for runtime code specialization (e.g. updating binary kernels implementing DES-based crypt(3) for specific salt values, which may be done a lot quicker than building OpenCL kernels from source). Another sub-task is to make use of the gained knowledge and the created tools to make John the Ripper run faster.

Other relevant pages on this wiki:

Existing GPU assembler projects, for AMD GPUs:

for NVIDIA GPUs:

for Intel GPUs:

Other external resources: