This is currently known as format “bf-opencl” in John the Ripper 1.7.9-jumbo-6. Bcrypt isn't known to be much gpu friendly, the primary reason being the ridiculous amount of memory being used by each bcrypt hash. To make the matter worse the memory access is pseudo-random which makes it very difficult to cache the data into faster memory. With this we are left with two choices:
1.Use the slow and large global memory and spend more time fetching the operands than processing them.
2.Use the fast but small LDS(64KB) memory and severly limit the number of concurrent threads.
The LDS implementation is 2x faster than the global memory implentation despite of the hardware being underutilized. There is also a third option to make some use of global memory in order to utilize the otherwise idle SIMD units. The LDS implementation could be four times as fast on a 7970 had there been sufficient LDS. However this is only a matter of time when GPUs ship with larger LDS and we could fully utilize all the SIMD units.
On CPU the situtaion is different. Each bulldozer module has 32KB of L1 cache which is enough to store four 4KB S-Boxes and four P-boxes. We are running two threads on each core of a BD module and each thread process two interleaved hashes to better expoit the internal parallelism of the core. Also we don't need any more concurrent hashes to exploit BD module.
The basic difference in architecture of VLIW4 and GCN is that VLIW4 throws more hardware at each SPMD(work item) than GCN. Each PE(processing element) which process one SPMD has 4 ALU on VLIW4 vs 1 ALU per PE on GCN. Also we have more SPMD level parallelism on GCN than on VLIW4 for same number of ALUs(advertised as stream core). Now there are 64 PEs per comute unit on GCN vs 16 PEs on VLIW4. We also have 64KB LDS on GCN vs 32KB on VLIW4 per CU. So we can run 8 concurrent blowfish hashes per CU on VLIW4 vs 16 on GCN. Therefore hardware utilization is 8/16 or 50% on VLIW4 vs 16/64 or 25% on GCN. So does that make VLIW4 more efficient than GCN ? Unfortunately no. The reason being each blowfish hash doesn't have much internal parallelism to exploit the extra hardware thrown to each SPMD on VLIW4. Can we increase the increase the internal parallelism like we do in the bulldozer CPUs? The answer is again no,but this time it is due to limited LDS.
Blowfish could be a better fit for some APUs due to their large CPU GPU shared L3 cache. The current Trinity APU from AMD however doesn't have this feature but hopefully next generation of APUs will have support for this. Intel ivy bridge CPUs have this feature enabled but their HD4000/2500 GPU is probably not powerful enough for the job. Although L3 significantly slower than L1 or L2 but it is still many times faster than global memory. So maybe this could result in better hardware utilization on APUs.
In 1.7.9-jumbo-6, “bf-opencl” is pre-configured for AMD Radeon HD 7970. For other GPUs:
You may use this expression to calculate the WORK_GROUP_SIZE for different devices:
WORK_GROUP_SIZE = (usable LDS size per WORK_GROUP in KB) / 4
eg.Usable LDS_SIZE per Work Group is 32KB for 7970 although actual LDS size is 64 KB.
WORK_GROUP_SIZE = 32 / 4 = 8 (for 7970)
# CPU is FX-8120 at stock clocks user@bull:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE Raw: 5300 c/s real, 664 c/s virtual # HD 7970 at stock clocks (925 MHz core) user@bull:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 4143 c/s real, 238933 c/s virtual # HD 7970 overclock to match the CPU ;-) user@bull:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable --od-setclocks=1225,1375 AMD Overdrive(TM) enabled Default Adapter - AMD Radeon HD 7900 Series New Core Peak : 1225 New Memory Peak : 1375 user@bull:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 5471 c/s real, 358400 c/s virtual