Differences

This shows you the differences between two versions of the page.

--- ideas [2015/03/03 08:17]
solar link to GSoC 2015 page for Openwall
+++ ideas [2015/03/27 03:35] (current)
solar ZTEX board is now available to students in the US
@@ Line 7: / Line 7: @@
 ===== Tasks considered for 2015 =====
-The (sub-)headings below serve as "student roles" for the purposes of our [[apply|application template]].
+The main headings below (such as "John the Ripper SIMD support enhancements") serve as "student roles" for the purposes of our [[apply|application template]] for GSoC.  When applying outside of GSoC, such as for our Summer of Security, the sub-headings (such as "John the Ripper SIMD support enhancements: AVX2 sub-task") may also be used as "student roles" (those sub-tasks are deemed too small for GSoC, but are OK for Summer of Security).
 ==== John the Ripper jumbo robustness improvements ====
@@ Line 15: / Line 15: @@
 It is very easy for new code to get into jumbo, or we wouldn't have received and integrated support for so many "formats" (hundreds).  Unfortunately, this also means that many of the formats supported in jumbo are not robust when faced with invalid input.
-The task here is to identify specific issues (how jumbo may misbehave on specific invalid inputs - e.g., dereference a NULL pointer or incur an integer overflow or simply load an impossible hash encoding for cracking instead of rejecting it right away) and to patch those.  To find the issues, you'll be expected to use a combination of fuzzing (such as with [[http://lcamtuf.coredump.cx/afl/|afl]] and with simple custom John the Ripper specific fuzzers that you'll need to write) and source code reviews (both manual reviews and automated static analysis).  And you'll be expected to patch all of the issues yourself, not merely communicate them to jumbo developers.
+The task here is to identify specific issues (how jumbo may misbehave on specific invalid inputs - e.g., dereference a NULL pointer or incur an integer overflow or simply load an impossible hash encoding for cracking instead of rejecting it right away) and to patch those.  To find the issues, you'll be expected to use a combination of fuzzing (such as with [[http://lcamtuf.coredump.cx/afl/|afl]] and with simple custom John the Ripper specific fuzzers e.g. [[http://www.openwall.com/lists/john-dev/2015/03/07/10|like]] [[http://www.openwall.com/lists/john-dev/2015/03/07/15|these]] that you'll need to write) and source code reviews (both manual reviews and automated static analysis).  And you'll be expected to patch all of the issues yourself, not merely communicate them to jumbo developers.
 Unfortunately, due to sheer size of would-be attack surface, we don't expect this project to result in a confidence level sufficient for us to declare jumbo being ready to process untrusted input in the security sense, but we do expect the project, if successful, to result in a much more robust jumbo, where almost all of the formats will just happen to be robust in practice.
@@ Line 23: / Line 23: @@
 After completion of the initial project, we'll expect you to stay on the team as "jumbo robustness officer".
-==== John the Ripper support for Xeon Phi ====
+==== John the Ripper SIMD support enhancements ====
-Difficulty: low to medium
+Difficulty: medium
-We're interested in addition of Xeon Phi's MIC intrinsics to existing SIMD-enabled code in John the Ripper.  This is not so much to achieve great speeds on Xeon Phi itself (the speeds are expected to be on par with GPUs, which are generally cheaper), but is mostly to prepare for AVX-512 in near-future CPUs, given that MIC and AVX-512 are largely compatible at source code intrinsics level (even though the resulting opcodes are different).
+So far, John the Ripper -jumbo uses SIMD intrinsics for up to Intel AVX and AMD XOP, and for bitslice DES also ARM NEON and PowerPC AltiVec.  It also contains some x86 and x86-64 assembly code that uses SSE2 instructions, and x86(-64) CPU detection routines capable of checking for up to AVX and XOP support.
-This task also involves getting John the Ripper -jumbo built for Xeon Phi.  So far, we only built for Xeon Phi the core, non-jumbo John the Ripper tree, which is less demanding in terms of library dependencies, and we added MIC intrinsics for John the Ripper's bitslice DES code.
+We need to add AVX2 and MIC/AVX-512 intrinsics, and then detection of the CPU's AVX2 and AVX-512 support as appropriate.  We also need to optimize the code in other ways (interleaving and maybe bitslicing).  We wouldn't mind the addition of ARM NEON, PowerPC AltiVec, and/or other non-x86 SIMD intrinsics to more JtR "formats" as well (beyond bitslice DES).
+=== AVX2 ===
+We need to add optional use of 256-bit AVX2 intrinsics to JtR -jumbo formats that now use up to AVX/XOP intrinsics, and we need to add AVX2 support detection.
+AVX2 might or might not run faster on current Haswell CPUs, but we need it for future CPUs anyway.  We already have 256-bit AVX (the "floating-point" one) for bitslice DES, but it's no faster than 128-bit on current CPUs.  It's currently only enabled in 32-bit builds, as it might be partially compensating for having only 8 registers there (the performance differences between 128-bit and 256-bit AVX are very small, though).  AVX2 might help for MD4/MD5/SHA-1/SHA-2.  In [[http://www.openwall.com/php_mt_seed/|php_mt_seed]], AVX2 actually helps a lot (as tested on our i7-4770K).
+We're able to provide remote access to a Haswell machine to test this on, if the student doesn't have one of their own.
+=== MIC (Xeon Phi), AVX-512 ===
+Similarly, we need to add Xeon Phi's 512-bit MIC intrinsics to existing SIMD-enabled code in John the Ripper.  This is not so much to achieve great speeds on Xeon Phi itself (the speeds are expected to be on par with GPUs, which are generally cheaper), but is mostly to prepare for AVX-512 in near-future CPUs, given that MIC and AVX-512 are largely compatible at source code intrinsics level (even though the resulting opcodes are different).
+This task also involves getting John the Ripper -jumbo built for Xeon Phi.  So far, we only built for Xeon Phi the core, non-jumbo John the Ripper tree, which is less demanding in terms of library dependencies, and we added MIC intrinsics for John the Ripper's bitslice DES code.  (See [[http://www.openwall.com/lists/john-dev/2015/03/05/2|this posting]] and [[http://www.openwall.com/lists/john-dev/2015/03/11/1|another one]] on the limited achievements so far.)
 We're able to provide remote access to [[:HPC/Village|a machine with Xeon Phi]] for this project.
-==== Better SIMD implementations of SHA-2 hashes and higher-level schemes ====
+=== non-x86 SIMD intrinsics ===
-Difficulty: medium
+We'd accept the addition of SIMD intrinsics for popular non-x86 architectures such as ARM and PowerPC to more JtR formats.  Right now, we only have those for bitslice DES, but their addition for MD4/MD5/SHA-1/SHA-2 should be straightforward.
-Thanks to a contribution by Jeremi Gosney, we have straightforward SIMD implementations of SHA-256 and SHA-512 as John the Ripper formats.  However, these might not fully hide modern and future CPUs' instruction latencies and/or fully exploit their instruction-level parallelism (beyond SIMD).  Thus, interleaving of instructions (or C source code SIMD intrinsics) needs to be added and optimal interleave factors determined for current CPUs.
+We're able to provide remote access to an ARM development board with a NEON-capable CPU.  For Power architecture, [[https://gcc.gnu.org/wiki/CompileFarm|GCC Compile Farm]] or a [[https://www.runabove.com/instances/ibm-power8.xml|cloud hosting provider]] may be used.  (Some other options are the older PowerPC-based Macs, Rosetta on x86-based Macs with OS X up to 10.6, and [[https://people.debian.org/~aurel32/qemu/powerpc/|QEMU]].  Obviously, the options involving emulation are less suitable for code optimization.)
-Additionally, although this is unlikely on current CPUs and GPUs, bitslice implementations //might// turn out to deliver better performance on some CPUs or/and GPUs (and even if not yet, which is likely, they may be valuable for relatively quickly re-testing this on future hardware).  A possible source for speedup is the bit rotate operations, of which there are many in SHA-2, and which become no-ops with a bitslice implementation (on the other hand, many intermediary values will have to be in L1 cache or local memory rather than in registers, which may slow things down to a larger degree).  We've previously demonstrated how it is possible to create a bitslice implementation of MD5 ([[http://www.openwall.com/lists/john-users/2007/11/12/3|original]], [[http://www.openwall.com/lists/john-users/2007/11/12/5|revised]]), and SHA-2 are similar in design.
+This sub-task is considered optional, but it's also a straightforward one.
-One sub-task here is to try and optimize the straightforward SIMD implementations further.  Another sub-task is to create bitslice implementations and to compare their performance against the straightforward implementations on actual hardware.  And a third sub-task (the most important one in terms of its immediate practical benefit) is to migrate JtR's implementations of higher-level password hashing schemes, etc. that are based (in part) on SHA-256 and SHA-512 (and currently use non-SIMD implementations of these) to one or both kinds of the SIMD implementations.  This applies to SHA-crypt (password hashes supported in glibc 2.7+ and in use by certain modern Linux distributions and by DragonFly BSD), Mac OS X 10.7+ salted SHA-512 password hashes, Drupal 7 password hashes, and more.
+=== Better SIMD implementations of SHA-2 hashes and higher-level schemes ===
+We currently have straightforward SIMD implementations of SHA-256 and SHA-512 (computing, respectively, 4 or 2 of these hashes per 128-bit SIMD vector).  However, these implementations might not fully hide modern and future CPUs' instruction latencies and/or fully exploit their instruction-level parallelism (beyond SIMD).  Thus, interleaving of instructions (or C source code SIMD intrinsics) needs to be added and optimal interleave factors determined for current CPUs.
+Additionally, although this is unlikely on current CPUs and GPUs, [[wp>Bit_slicing#Modern_use|bitslice]] implementations //might// turn out to deliver better performance on some CPUs or/and GPUs (and even if not yet, which is likely, they may be valuable for relatively quickly re-testing this on future hardware).  A possible source for speedup is the bit rotate operations, of which there are many in SHA-2, and which become no-ops with a bitslice implementation (on the other hand, many intermediary values will have to be in L1 cache or local memory rather than in registers, which may slow things down to a larger degree).  We've previously demonstrated how it is possible to create a bitslice implementation of MD5 ([[http://www.openwall.com/lists/john-users/2007/11/12/3|original]], [[http://www.openwall.com/lists/john-users/2007/11/12/5|revised]], [[http://www.openwall.com/lists/john-dev/2015/03/11/10|latest]]), and SHA-2 are similar in design.
+One sub-task here is to try and optimize the straightforward SIMD implementations further.  Another sub-task is to create bitslice implementations and to compare their performance against the straightforward implementations on actual hardware.  And a third sub-task (the most important one in terms of its immediate practical benefit) is to ensure JtR's implementations of higher-level password hashing schemes, etc. that are based (in part) on SHA-256 and SHA-512 will use one of the optimized SIMD implementations.  This applies to SHA-crypt (password hashes supported in glibc 2.7+ and in use by certain modern Linux distributions and by DragonFly BSD), Mac OS X 10.7+ salted SHA-512 password hashes, Drupal 7 password hashes, and more.  As of this writing, some of the high-level schemes use the existing (not yet interleaved) SIMD implementations of SHA-256 and SHA-512, and some don't even use the existing SIMD implementations yet (use the older non-SIMD implementations or even calls into OpenSSL libcrypto).  When this project is complete, all of them should use the most optimal SIMD implementations.
+The bitslicing sub-task is beyond the "medium" difficulty we have stated for this GSoC project idea, but it is considered a stretch goal.  We won't consider a GSoC project failed if the student does not succeed solely at the bitslicing sub-task while having completed everything else.
 ==== John the Ripper support for PHC finalists ====
@@ Line 47: / Line 69: @@
 Difficulty: medium to high
-We need to decide on (unofficial) crypt(3) encoding syntax for [[https://password-hashing.net/candidates.html|PHC finalists]] and optionally some non-finalists, and implement them in John the Ripper.
+We need to decide on (unofficial) crypt(3) encoding syntax for [[https://password-hashing.net/candidates.html|Password Hashing Competition (PHC) finalists]] (Argon, battcrypt, Catena, Lyra2, Makwa, Parallel, POMELO, Pufferfish, yescrypt) and optionally some non-finalists, and implement them in John the Ripper.
 At first, this should be done in straightforward manner using the readily available implementations, preferably adding OpenMP parallelization around calls into the existing implementations (except for those determined not to be MT-safe, if any).
@@ Line 61: / Line 83: @@
 Difficulty: medium
-Additional implementations of [[http://www.openwall.com/presentations/PHDays2014-Yescrypt/|yescrypt]], our PHC finalist password hashing scheme, should be produced in/for programming languages other than C (for which we already have implementations).  For each language, two implementations may be produced: simpler and optimized, unless the difference between such two implementations in a given language in terms of simplicy and performance would turn out (or be reasonably expected) to be small.  Additionally, where relevant two different optimized implementations may be produced instead of one.  For example, in C we have 3 implementations total: reference (mostly unoptimized, but simpler), optimized non-SIMD, and optimized SIMD - and a similar separation could make sense for other languages that provide explicit SIMD support (these days [[https://hacks.mozilla.org/2014/10/introducing-simd-js/|even JavaScript does]]).  The exact set of languages is up to you - please communicate it to us first and we'll decide whether to proceed.
+Additional implementations of [[http://www.openwall.com/presentations/PHDays2014-Yescrypt/|yescrypt]], our PHC finalist password hashing scheme, should be produced in/for programming languages other than C (for which we already have implementations).  For each language, two implementations may be produced: simpler and optimized, unless the difference between such two implementations in a given language in terms of simplicity and performance would turn out (or be reasonably expected) to be small.  Additionally, where relevant two different optimized implementations may be produced instead of one.  For example, in C we have 3 implementations total: reference (mostly unoptimized, but simpler), optimized non-SIMD, and optimized SIMD - and a similar separation could make sense for other languages that provide explicit SIMD support (these days [[https://hacks.mozilla.org/2014/10/introducing-simd-js/|even JavaScript does]]). We are looking for someone with experience in writing clean, fast, and application developer friendly implementations of cryptographic algorithms in multiple languages, preferably including modern ones such as Go and Rust, as well as older ones such as JavaScript, Java, Python, Perl, PHP. The exact set of languages is up to you - please communicate it to us first and we'll decide whether to proceed.
 Besides targeting additional programming languages, we also welcome simpler/cleaner and/or better optimized implementations in C with and without intrinsics and assembly (for architectures of your choice, to be discussed with us in advance), but in this project we'll view this as an extra.
@@ Line 73: / Line 95: @@
 During and shortly after GSoC 2013, we [[http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/|successfully implemented bcrypt]] on [[http://www.parallella.org|Parallella]] board's Epiphany coprocessor (both 16- and 64-core versions), as well as on Xilinx Zynq 7020 and 7045 FPGAs.
-We're interested in adding support for more hashes and ciphers on more FPGA boards, with one of the current targets being bcrypt on the quad Spartan 6 LX150 [[http://www.ztex.de/usb-fpga-1/usb-fpga-1.15y.e.html|ZTEX 1.15y]], which we're able to provide (at least) remote access to.  (We may also provide the boards themselves to students in Moscow, Russia and Zagreb, Croatia.)
+We're interested in adding support for more hashes and ciphers on more FPGA boards, with one of the current targets being bcrypt on the quad Spartan 6 LX150 [[http://www.ztex.de/usb-fpga-1/usb-fpga-1.15y.e.html|ZTEX 1.15y]], which we're able to provide (at least) remote access to.  (We may also provide the boards themselves to students in the US, in Moscow, Russia, and in Zagreb, Croatia.)
 ==== descrypt on ZTEX 1.15y FPGA boards ====
@@ Line 83: / Line 105: @@
 With bcrypt, we expect the candidate passwords to be (mostly or fully) provided by the master system (where John the Ripper is running on a CPU), whereas with descrypt they will have to be mostly generated by FPGAs themselves (but with some control from the master).  These two hash types themselves also need to be implemented totally differently (although free implementations for both already exist, as shown above).
-==== Low-level GPU programming ====
+==== radare ideas ====
-Difficulty: high
+Specifically in GSoC 2015, we intend to [[http://www.openwall.com/lists/announce/2015/03/10/1|act as an umbrella organization]] for [[http://radare.org/r/|radare reverse-engineering framework]], so their [[http://rada.re/gsoc/|ideas]] are included in here by reference.  The expectation is that we'll end up accepting one student for a radare task.
-Starting in 2011, we've made considerable progress on adding GPU support to John the Ripper, via CUDA and OpenCL.  In the process, we've also identified limitations of these high-level approaches.  For example, for DES-based crypt(3) hashes, there's substantial performance improvement from specializing the code to a given salt value.  While we can specialize OpenCL source code and build per-salt OpenCL kernels at runtime, this takes tens of minutes for the 4096 salt values.  This delays program startup or at least the time until the programs gets to running at full speed.  For another example, for bcrypt hashes we (and two other projects) have achieved only CPU-like performance on current high-end GPUs.  While there's good explanation for that (not enough local memory to fully use the SIMD units and to hide the latencies), we're not entirely convinced that nothing better can be done by programming AMD GCN GPUs (such as the HD 7970) at a level below OpenCL - that is, at AMD IL or/and AMD GCN ISA level.  For example, to what extent is the limitation of 256 VGPRs per work-item inherent to GCN?  Can we bypass it with a non-standard programming model (e.g. have a work-item access what would normally be another work-item's VGPRs)?  ([[http://article.gmane.org/gmane.comp.security.phc/2216|Apparently not, or at least not easily.]])  Since the combined size of VGPRs per CU is 4x larger than the size of local memory per CU, yet there's support for indexed access to VGPRs, this may let us run more concurrent instances of bcrypt (up to 5x more?) and thereby achieve greater performance.
+If you're interested in applying for a radare task under Openwall's umbrella for GSoC, please contact us via e-mail at **radare at openwall.com**,
+which reaches both of our projects at once. To contact radare only, such as if you're not eligible for GSoC anyway but want to participate in radare project's own Radare Summer of Code (RSoC), please e-mail them at **rsoc at radare.org**. Finally, to contact Openwall only, such as for a project not involving radare, please see our [[http://openwall.info/wiki/apply|application template page]].
-The task here is to explore ways to write lower-level GPU code, possibly with specific focus on AMD GCN or/and on NVIDIA Maxwell, and also to analyze OpenCL-generated code at a low level to identify its shortcomings.  We may also produce custom development tools, such as to allow for runtime code specialization (e.g. updating binary kernels implementing DES-based crypt(3) for specific salt values, which may be done a lot quicker than building OpenCL kernels from source).
-Other relevant pages on this wiki:
-  * [[john/development/AMD-IL|Development in AMD IL]]
-  * [[john/development/GCN-ISA|Development in AMD GCN ISA]]
-AMD documentation:
-  * [[http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf|AMD GCN ISA]]
-Related third-party projects, for AMD GPUs:
-  * [[http://www.ast.cam.ac.uk/~stg20/amdstream/|Assembler for AMD HD 69xx cards]] (source code available, but no license provided)
-  * [[http://realhet.wordpress.com|Pascal + assembler + IDE for AMD GCN ISA]] (Windows, closed-source)
-    * [[http://x.pgy.hu/~worm/het/hp/|Download link for the above]] as the link currently on the blog above is broken
-    * [[http://devgurus.amd.com/thread/159954|Forum thread where this project was introduced by its author]]
-for NVIDIA GPUs:
-  * [[http://www.openwall.com/lists/john-dev/2012/03/24/13|Usable assembly language for GPUs: a success story]] (published paper, but the qhasm-cudasm tool is not released)
-  * [[http://code.google.com/p/asfermi/|asfermi: An assembler for the NVIDIA Fermi Instruction Set]]
-  * [[https://github.com/laanwj/decuda/wiki|Cubin Utilities (decuda and cudasm)]]
 ==== Own creative and relevant idea ====
 Your own creative and relevant idea - please propose it to us first, then describe it again when you apply.
+In particular, we'll consider ideas involving [[http://www.musl-libc.org|musl]] standard C library (here are [[http://www.openwall.com/lists/musl/2015/02/10/12|some examples]]) and/or [[http://www.openwall.com/Owl/|Openwall GNU/*/Linux]] (or Owl), our Linux distro.
 ===== Licensing =====