Low-level GPU programming

This is content moved from our archived GSoC ideas page:

This project was worked on under GSoC 2013, but there's more to do on it.

Starting in 2011, we've made considerable progress on adding GPU support to John the Ripper, via CUDA and OpenCL. In the process, we've also identified limitations of these high-level approaches. For example, for DES-based crypt(3) hashes, there's substantial performance improvement from specializing the code to a given salt value. While we can specialize OpenCL source code and build per-salt OpenCL kernels at runtime, this takes tens of minutes for the 4096 salt values. This delays program startup or at least the time until the programs gets to running at full speed. For another example, for bcrypt hashes we (and two other projects) have achieved only CPU-like performance on current high-end GPUs. While there's good explanation for that (not enough local memory to fully use the SIMD units and to hide the latencies), we're not entirely convinced that nothing better can be done by programming AMD GCN GPUs (such as the HD 7970) at a level below OpenCL - that is, at AMD IL or/and AMD GCN ISA level. For example, to what extent is the limitation of 256 VGPRs per work-item inherent to GCN? Can we bypass it with a non-standard programming model (e.g. have a work-item access what would normally be another work-item's VGPRs)? (Apparently not, or at least not easily.) Since the combined size of VGPRs per CU is 4x larger than the size of local memory per CU, yet there's support for indexed access to VGPRs, this may let us run more concurrent instances of bcrypt (up to 5x more?) and thereby achieve greater performance.

A sub-task here is to explore ways to write lower-level GPU code, possibly with specific focus on AMD GCN or/and on NVIDIA Maxwell, and also to analyze OpenCL-generated code at a low level to identify its shortcomings. We may also produce custom development tools, such as to allow for runtime code specialization (e.g. updating binary kernels implementing DES-based crypt(3) for specific salt values, which may be done a lot quicker than building OpenCL kernels from source). Another sub-task is to make use of the gained knowledge and the created tools to make John the Ripper run faster.

Other relevant pages on this wiki:

Existing GPU assembler projects, for AMD GPUs:

CLRadeonExtender: Assembler and disassembler for AMD GCN with support for AMD Catalyst and GalliumCompute binary formats (free: GNU LGPL 2.1+, GNU GPL 2+, GNU FDL 1.2)
cmingcnasm: C language MINimal GCN ASseMbler (source code available, GNU AGPLv3)
GCN assembler in C# (Windows, C# source, gratis but not free software)
gcnasm, our GSoC 2013 project to implement an AMD GCN assembler (unfinished yet almost usable, free)
- gcnasm fork with some bugs fixed (usable, free)
- scrypt (Litecoin mining) implemented with gcnasm fork above
Assembler for AMD HD 69xx cards (source code available, but no license provided)
Pascal + assembler + IDE for AMD GCN ISA (Windows, closed-source)
- Download link for the above as the link currently on the blog above is broken
- Forum thread where this project was introduced by its author

for NVIDIA GPUs:

MaxAs: Assembler for NVIDIA Maxwell architecture (old project)
asfermi: Assembler for NVIDIA Fermi architecture
Cubin Utilities (decuda and cudasm)
Usable assembly language for GPUs: a success story (published paper, but the qhasm-cudasm tool is not released)

for Intel GPUs:

Code demonstrating how to load custom ISA on Intel Haswell GPUs via OpenGL (assembler, disassembler, loader for HD Graphics 4400 under Windows 8.1, released under GPLv3)

Other external resources:

Alexander Tarasikov's blog post with links to open source low-level GPU projects