GPU development

This page currently contains mostly obsolete content moved from the John the Ripper GPU support page.

Slow and Fast hashes

“Slow” hashes are those that implement multiple iterations of a cryptographic primitive for computation of just one hash. The various modern Unix crypt(3) flavors are an example of these.

“Fast” hashes are those that rely on a single computation (or very few computations) of a cryptographic primitive. NTLM is an example.

Currently gpu sha256 calculations are divided into three following phases:

  • CPU-1) generate candidate passwords
    1. GPU-1) copy candidate passwords from main memory to gpu memory
    2. GPU-2) calculate hash for each candidate passwords
    3. GPU-3) copy candidate hash from gpu memory to main memory
  • CPU-2) compare calculated hashes with pattern

Following created via nVidia Visual Profiler charts shows time consumption during accordingly fast and slow hash computation. These are two variants of SHA256, fast - 1 iteration and slow 5000 iterations.

For example on the same GPU we are able to compute 2'400'000 fast-SHA256 hashes per second and only 2'520 slow-SHA256 hashes per second.

CUDA tips and tricks

Loop unrolling

NVCC offers preprocessor loop unrolling pragma, which should be placed line before loop.
#pragma unroll N
Of course N should be known at compile time. If it is not, but you know that it is always, for example, a multiple of 4, you can use
#pragma unroll 4
with good results (sometimes this is preferable even for larger loops where you know N). Unrolling might cause nvcc to use more registers than we expected, and overall performance drop. To avoid that we need to tell compiler “man you can't use more than X registers!”. It can be acheived by adding
–maxrregcount=X to nvcc parametes. Please keep in mind that setting to restrictive values may cause hard to find computations errors.

Trivial loops will be unrolled without the pragma, by the optimizer. Some complicated loops will just get slower with full unrolling so be sure to test it if you add it.

For tracking API errors “CUDA by Example” authors suggests use following macro-function:

static void HandleError(cudaError_t err,const char *file,int line)
if(err!=cudaSuccess){
printf("%s in %s at line %d\n",cudaGetErrorString(err),file,line);
exit(EXIT_FAILURE);
 }
}
#define HANDLE_ERROR(err) (HandleError(err,__FILE__,__LINE__))

CUDA development environment

OpenCL tips and tricks

Loop unrolling

OpenCL offers preprocessor loop unrolling pragma, which should be placed line before loop.
#pragma unroll N
Of course N should be known at compile time (for static loops you do not even have to specify N). If it is not, but you know that it is always, for example, a multiple of 4, you can use
#pragma unroll 4
with good results. Sometimes this is preferable even for larger loops where you know N.

Trivial loops will be unrolled anyway, by the optimizer. Some complicated loops will just get slower with full unrolling so be sure to test it if you add it.

Various caveats
  • In OpenCL kernels, a “long” is always 64 bits. This is important to know when porting CPU code (actually on intel a long is “arch size” so you should never make any assumptions of its size other than it's at least the size of an int - and what good is that?)
  • All non-kernel functions should be either “static” or “inline” (and as far as I know there is no difference). Not declaring either will result in build warnings (or worse) on some platforms.
  • Any global varaibles must be __constant.
  • void pointers are not allowed in OpenCL kernels. When porting CPU code, change them to whatever suits the function best.
  • For maximum portability, never use 8-bit or 16-bit stores. Use macros that read/modify/write a full 32-bit int instead. For performance you should of course try to do as much as possible using full ints (eg. copy four characters at a time of a key instead of calling an expensive bit-flogging macro four times).
Clang warnings

OpenCL warnings are not displayed by default in JtR, but we still can see them using Clang compiler. Adding the following code to .cl file will reduce amount of false warnings/errors:

#ifdef CLANG 
#define uint	unsigned int
#define uchar	unsigned char
uint get_global_id(int index);
#endif

Now we can compile .cl file using CLANG:

clang -S -emit-llvm -o test.ll -Weverything -DCLANG -x cl pbkdf2_hmac_sha256_unsplit_kernel.cl 

Sample output:

pbkdf2_hmac_sha256_unsplit_kernel.cl:155:16: warning: cast from 'unsigned char *' to 'unsigned int *' increases
    required alignment from 1 to 4 [-Wcast-align]
      uint *buf32 = (uint *) buf;
                    ^~~~~~~~~~~~
pbkdf2_hmac_sha256_unsplit_kernel.cl:247:24: warning: comparison of integers of different signs: 'int' and
    'unsigned int' [-Wsign-compare]
      for (round = 1; round < rounds; round++) {
                      ~~~~~ 
2 warnings generated.

nVidia GT200 architecture details:http://www.anandtech.com/show/2549/1

nVidia CUDA optimizations techniques Advanced CUDA

Benchmarks

Here are sample benchmark results for CUDA patches.

Hardware

  • ID : processor, memory, gpu, operating system
  • C-01: i3 2100, 4GB 1333MHz, GeForce 9800GT, slackware 13.1 32bit
  • C-02: GeForce 460m, 64bit
  • C-03: C2Duo P7350 2GHz,GF 9600m
  • C-04: 9800GTX
  • C-05: GTX 570
  • C-06: GTX 460 1024M

raw-SHA256

Benchmarking: SHA256CUDA [SHA256]… DONE

john-1.7.6-sha256cuda-0.diff
  • C-01 : Raw: 5734K c/s real, 5745K c/s virtual
  • C-03 : Raw: 1795k c/s real, 1795k c/s virtual
  • C-04 : Raw: 4456k c/s real 4412k c/s virtual
  • C-06 : Raw: 10443K c/s real, 10527K c/s virtual

portable phpass

Benchmarking: PHPASSCUDA [PORTABLE-MD5]… DONE

john-1.7.6-phpasscuda-0.diff
  • C-01 : Raw: 27459 c/s real, 27510 c/s virtual
  • C-02 : Raw: 49461 c/s real, 49461 c/s virtual
  • C-05 : Raw: 177075 c/s real, 177075 c/s virtual
  • C-06 : Raw: 102400 c/s real, 102603 c/s virtual

john-1.7.6-phpasscuda-1.diff
  • C-06 : Raw: 187200 c/s real, 187560 c/s virtual
john-1.7.8-phpasscuda-2.diff
  • C-06 : Raw: 237804 c/s real, 237804 c/s virtual
john-1.7.8-phpasscuda-3.diff
  • C-06 : Raw: 292129 c/s real, 292129 c/s virtual

john-1.7.8-phpassopencl-0.diff
  • C-06 : Raw: 217396 c/s real, 218265 c/s virtual

SHA256-based crypt

john-1.7.7-cryptsha256cuda-0.diff
  • C-01 : Raw: 1518 c/s real, 1518 c/s virtual [14 blocks, 192 threads]
  • C-06 : Raw: 4553 c/s real, 4553 c/s virtual [30 blocks, 256 threads]
john-1.7.7-cryptsha256cuda-1.diff
  • C-06 : Raw: 6813 c/s real, 6813 c/s virtual [14 blocks, 256 threads]
john-1.7.7-cryptsha256cuda-2.diff
  • C-06 : Raw: 7798 c/s real, 7798 c/s virtual [14 blocks, 320 threads]

john-1.7.8-cryptsha256opencl-0.diff
  • C-06 : Raw: 5472 c/s real, 5472 c/s virtual [14 blocks, 320 threads]

SHA512-based crypt

john-1.7.7-cryptsha512cuda-0.diff
  • C-06 : Raw: 4794 c/s real, 7798 c/s virtual [14 blocks, 512 threads]

MD5-based crypt

john-1.7.7-cryptmd5cuda-0.diff
  • C-06 : Raw: 122328 c/s real, 122572 c/s virtual [28 blocks, 512 threads]
john-1.7.7-cryptmd5cuda-1.diff
  • C-06 : Raw: 143360 c/s real, 143360 c/s virtual [28 blocks, 512 threads]
john-1.7.7-cryptmd5cuda-2.diff
  • C-06 : Raw: 351260 c/s real, 355509 c/s virtual [28 blocks, 384 threads]
john-1.7.8-cryptmd5cuda-3.diff
  • C-06 : Raw: 391372 c/s real, 394529 c/s virtual [28 blocks, 384 threads]

MSCash

john-1.7.8-mscashcuda-0.diff
  • C-06 : Raw: 20671K c/s real, 20671K c/s virtual [21 blocks, 512 threads]

MSCash2

CUDA

john-1.7.8-mscash2cuda-0.diff
  • C-06 : Raw: 8286 c/s real, 8302 c/s virtual [21 blocks, 256 threads]

OpenCL

OpenCL version of MSCash2 supports cracking on multiple GPUs. For more details read the 'common_opencl_pbkdf2.h'.

      Benchmarks: AMD Radeon HD 7970: 92.5K c/s real 
                  ATI Radeon HD 4890: 19.9K c/s real
                  Nvidia GTX 570    : 26.7K c/s real
                  AMD Radeon HD 7970 + Nvidia GTX 570: 113.5K c/s real

Thanks

Special thanks for:

  • Solar Designer, Alexandru Tudorica, Bogdan Calin

for submitting their results.

Maintenance

Situation of the source code

Maintainer Updated (1) Fast (2) Pass TS (3)
ssha-opencl Samuele? No
raw-md4-opencl Dhiru Yes No Yes
raw-md5-opencl Dhiru Yes No Yes
nt-opencl Samuele? No
raw-sha1-opencl Samuele? No
md5crypt-opencl Lukas Yes (a) No Yes
phpass-opencl Lukas
mysql-sha1-opencl Samuele? No
sha256crypt-opencl Claudio Yes No Yes
sha512crypt-opencl Claudio Yes No Yes
mscash2-opencl Sayantan
wpapsk-opencl Lukas/magnum Yes Yes
keychain-opencl Dhiru
agilekeychain-opencl Dhiru
strip-opencl Dhiru
zip-opencl Dhiru
encfs-opencl Dhiru
odf-opencl Dhiru
sxc-opencl Dhiru
gpg-opencl Dhiru
dmg-opencl Dhiru
xsha512-opencl Myrice No
xsha512-ng-opencl Claudio Yes No Yes
raw-sha512-opencl Myrice No
raw-sha512-ng-opencl Claudio Yes No Yes
raw-sha256-opencl Claudio Yes No Yes
bf-opencl Sayantan No
pwsafe-opencl Dhiru Yes (a) Yes
des-opencl Sayantan No
office2007-opencl magnum Yes Yes
office2010-opencl magnum Yes Yes
office2013-opencl magnum Yes Yes
ntlmv2-opencl magnum No
krb5pa-sha1-opencl magnum Yes
rar-opencl magnum Yes No

Notes

(1) Source code follows the latest guidelines.
(2) Fast (or on par) compared to other tools.
(3) Succed on tests like this:

  • ./jtrts.pl -noprelims -stoponerror -type x-opencl
  • ./jtrts.pl -noprelims -stoponerror -type x-opencl -passthru ”-pla:1 -de:1”
  • ./jtrts.pl -noprelims -stoponerror -type x-opencl -passthru ”-pla:1”

(a) No splitted kernel

Guidelines

All formas need to be tested on CPU, NVIDIA and AMD (see (3) above).

  • Must have
    • Honor LWS and GWS environment variables (overriding john.conf if applicable)
    • Auto-tuning (if this is slow, it should allow use of john.conf for saving default values - LWS_CONFIG, GWS_CONFIG)
    • Adjust min_keys_per_crypt to local_work_size.
    • Adjust global worksize to the count argument inside crypt_all().
    • Implement a proper done method.
    • Quick-response to “events” (like key presses). (slow hashes)
    • Split kernel execution. (slow hashes)
  • Important
    • Adjust max duration for GPU and CPU accordingly. One-two seconds is maximum for not hitting watchdogs.
    • Auto-tuning by doing less iterations (slow hashes)
    • Treat the new_keys condition (salted).
    • Auto-tuning with detailed information (something like):
      . pass xfer: 10.01 ms, crypt: 3.46 ms, result xfer: 1.84 ms
      . gws:  16384      7068 c/s  35341675 rounds/s   2.318 sec per crypt_all()

Testing

At this time, this section contains only a few formats (more to come).

Acceptance Tests

./jtrts.pl -noprelims -stoponerror -type raw-sha256-opencl raw-sha512-ng-opencl sha256crypt-opencl sha512crypt-opencl xsha512-ng-opencl
./jtrts.pl -noprelims -stoponerror -type raw-sha256-opencl raw-sha512-ng-opencl sha256crypt-opencl sha512crypt-opencl xsha512-ng-opencl -passthru "-pla:1 -de:1"
./jtrts.pl -noprelims -stoponerror -type raw-sha256-opencl raw-sha512-ng-opencl sha256crypt-opencl sha512crypt-opencl xsha512-ng-opencl -passthru "-pla:1"

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-512 -P 0 -R 1233 ; done > ~/testhashes
../run/john -fo:sha512crypt-opencl ~/testhashes --incremental

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-256 -P 0 -R 1233 ; done > ~/testhashes
../run/john -fo:sha256crypt-opencl ~/testhashes --incremental

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-512 -P 0 -R 1233 ; done > ~/testhashes
../run/john -fo:sha512crypt-opencl ~/testhashes --incremental -pla:1

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-256 -P 0 -R 1233 ; done > ~/testhashes
../run/john -fo:sha256crypt-opencl ~/testhashes --incremental -pla:1

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-512 -P 0 ; done > ~/testhashes
../run/john -fo:sha512crypt-opencl ~/testhashes --incremental -pla:1

for i in `../run/john -inc -stdout | head -10000 | shuf | head -200`; do echo $i | mkpasswd -m sha-256 -P 0 ; done > ~/testhashes
../run/john -fo:sha256crypt-opencl ~/testhashes --incremental -pla:1
john/development/GPU.txt · Last modified: 2013/07/13 23:26 by ukasz
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate to DokuWiki Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Powered by OpenVZ Powered by Openwall GNU/*/Linux Bookmark and Share