This is an old revision of the document!


Openwall Project ideas

Some of these ideas were initially suggested for Google Summer of Code (GSoC), but they're also relevant to our own Summer of Security program. When applying to us for GSoC or otherwise, please use our application template (which also includes info on how to contact us).

Although we have a lot of ideas, our mentoring capacity is limited and many of the ideas would be incompatible if worked on during the same summer. Thus, in 2015 we intend to work on a subset of these ideas only (as well as on many other things).

Tasks considered for 2015

The (sub-)headings below serve as “student roles” for the purposes of our application template.

John the Ripper jumbo robustness improvements

Difficulty: low to medium

It is very easy for new code to get into jumbo, or we wouldn't have received and integrated support for so many “formats” (hundreds). Unfortunately, this also means that many of the formats supported in jumbo are not robust when faced with invalid input.

The task here is to identify specific issues (how jumbo may misbehave on specific invalid inputs - e.g., dereference a NULL pointer or incur an integer overflow or simply load an impossible hash encoding for cracking instead of rejecting it right away) and to patch those. To find the issues, you'll be expected to use a combination of fuzzing (such as with afl and with simple custom John the Ripper specific fuzzers that you'll need to write) and source code reviews (both manual reviews and automated static analysis). And you'll be expected to patch all of the issues yourself, not merely communicate them to jumbo developers.

Unfortunately, due to sheer size of would-be attack surface, we don't expect this project to result in a confidence level sufficient for us to declare jumbo being ready to process untrusted input in the security sense, but we do expect the project, if successful, to result in a much more robust jumbo, where almost all of the formats will just happen to be robust in practice.

We'd also appreciate code quality improvements that don't currently affect runtime behavior of the program.

After completion of the initial project, we'll expect you to stay on the team as “jumbo robustness officer”.

John the Ripper support for Xeon Phi

Difficulty: low to medium

We're interested in addition of Xeon Phi's MIC intrinsics to existing SIMD-enabled code in John the Ripper. This is not so much to achieve great speeds on Xeon Phi itself (the speeds are expected to be on par with GPUs, which are generally cheaper), but is mostly to prepare for AVX-512 in near-future CPUs, given that MIC and AVX-512 are largely compatible at source code intrinsics level (even though the resulting opcodes are different).

This task also involves getting John the Ripper -jumbo built for Xeon Phi. So far, we only built for Xeon Phi the core, non-jumbo John the Ripper tree, which is less demanding in terms of library dependencies, and we added MIC intrinsics for John the Ripper's bitslice DES code.

We're able to provide remote access to a machine with Xeon Phi for this project.

Better SIMD implementations of SHA-2 hashes and higher-level schemes

Difficulty: medium

Thanks to a contribution by Jeremi Gosney, we have straightforward SIMD implementations of SHA-256 and SHA-512 as John the Ripper formats. However, these might not fully hide modern and future CPUs' instruction latencies and/or fully exploit their instruction-level parallelism (beyond SIMD). Thus, interleaving of instructions (or C source code SIMD intrinsics) needs to be added and optimal interleave factors determined for current CPUs.

Additionally, although this is unlikely on current CPUs and GPUs, bitslice implementations might turn out to deliver better performance on some CPUs or/and GPUs (and even if not yet, which is likely, they may be valuable for relatively quickly re-testing this on future hardware). A possible source for speedup is the bit rotate operations, of which there are many in SHA-2, and which become no-ops with a bitslice implementation (on the other hand, many intermediary values will have to be in L1 cache or local memory rather than in registers, which may slow things down to a larger degree). We've previously demonstrated how it is possible to create a bitslice implementation of MD5 (original, revised), and SHA-2 are similar in design.

One sub-task here is to try and optimize the straightforward SIMD implementations further. Another sub-task is to create bitslice implementations and to compare their performance against the straightforward implementations on actual hardware. And a third sub-task (the most important one in terms of its immediate practical benefit) is to migrate JtR's implementations of higher-level password hashing schemes, etc. that are based (in part) on SHA-256 and SHA-512 (and currently use non-SIMD implementations of these) to one or both kinds of the SIMD implementations. This applies to SHA-crypt (password hashes supported in glibc 2.7+ and in use by certain modern Linux distributions and by DragonFly BSD), Mac OS X 10.7+ salted SHA-512 password hashes, Drupal 7 password hashes, and more.

John the Ripper support for PHC finalists

Difficulty: medium to high

We need to decide on (unofficial) crypt(3) encoding syntax for PHC finalists and optionally some non-finalists, and implement them in John the Ripper.

At first, this should be done in straightforward manner using the readily available implementations, preferably adding OpenMP parallelization around calls into the existing implementations (except for those determined not to be MT-safe, if any).

Then defense- and attack-optimized implementations should be produced. The former should compute individual hashes more efficiently, eliminating any redundancy there might have been in the previously available implementations (e.g., common subexpressions) and using CPUs' multi-issue and SIMD capabilities to the extent possible within the parallelism available during computation of individual hashes. The latter may additionally compute multiple hashes per thread in parallel in order to further increase CPUs' instruction issue rate (through interleaving of instructions corresponding to different hash computations) and/or SIMD vector width usage, beyond what was possible for individual hashes.

Similarly, OpenCL implementations should be created, also computing multiple hashes in parallel and thus intended for attack rather than defense, thereby letting the compiler make similar instruction-level optimizations. These may be benchmarked against the manually-optimized (such as with C and SIMD intrinsics) attack-optimized implementations mentioned above, and more importantly they should then be tuned for specific GPU architectures.

Since these password hashing schemes are (mostly) not in use yet, the intent of this project is not so much to enhance John the Ripper but is mostly to research the attack and defense speed ratios, as well as the speeds themselves, of these hashing schemes on currently available commodity hardware (CPUs and GPUs).

yescrypt implementations

Difficulty: medium

Additional implementations of yescrypt, our PHC finalist password hashing scheme, should be produced in/for programming languages other than C (for which we already have implementations). For each language, two implementations may be produced: simpler and optimized, unless the difference between such two implementations in a given language in terms of simplicy and performance would turn out (or be reasonably expected) to be small. Additionally, where relevant two different optimized implementations may be produced instead of one. For example, in C we have 3 implementations total: reference (mostly unoptimized, but simpler), optimized non-SIMD, and optimized SIMD - and a similar separation could make sense for other languages that provide explicit SIMD support (these days even JavaScript does). The exact set of languages is up to you - please communicate it to us first and we'll decide whether to proceed.

Besides targeting additional programming languages, we also welcome simpler/cleaner and/or better optimized implementations in C with and without intrinsics and assembly (for architectures of your choice, to be discussed with us in advance), but in this project we'll view this as an extra.

bcrypt on ZTEX 1.15y FPGA boards

Difficulty: high

Some hash and cipher types don't fit GPUs very well.

During and shortly after GSoC 2013, we successfully implemented bcrypt on Parallella board's Epiphany coprocessor (both 16- and 64-core versions), as well as on Xilinx Zynq 7020 and 7045 FPGAs.

We're interested in adding support for more hashes and ciphers on more FPGA boards, with one of the current targets being bcrypt on the quad Spartan 6 LX150 ZTEX 1.15y, which we're able to provide (at least) remote access to. (We may also provide the boards themselves to students in Moscow, Russia and Zagreb, Croatia.)

descrypt on ZTEX 1.15y FPGA boards

Difficulty: high

Similarly to the task above, we'd also like to implement DES-based crypt(3), also known as descrypt, on ZTEX 1.15y in a way that would integrate with John the Ripper. An implementation of descrypt on these boards is already available, but revising it to run against multiple hashes (likely with different salts) and with a smarter ordering of candidate passwords may be challenging.

With bcrypt, we expect the candidate passwords to be (mostly or fully) provided by the master system (where John the Ripper is running on a CPU), whereas with descrypt they will have to be mostly generated by FPGAs themselves (but with some control from the master). These two hash types themselves also need to be implemented totally differently (although free implementations for both already exist, as shown above).

Low-level GPU programming

Difficulty: high

Starting in 2011, we've made considerable progress on adding GPU support to John the Ripper, via CUDA and OpenCL. In the process, we've also identified limitations of these high-level approaches. For example, for DES-based crypt(3) hashes, there's substantial performance improvement from specializing the code to a given salt value. While we can specialize OpenCL source code and build per-salt OpenCL kernels at runtime, this takes tens of minutes for the 4096 salt values. This delays program startup or at least the time until the programs gets to running at full speed. For another example, for bcrypt hashes we (and two other projects) have achieved only CPU-like performance on current high-end GPUs. While there's good explanation for that (not enough local memory to fully use the SIMD units and to hide the latencies), we're not entirely convinced that nothing better can be done by programming AMD GCN GPUs (such as the HD 7970) at a level below OpenCL - that is, at AMD IL or/and AMD GCN ISA level. For example, to what extent is the limitation of 256 VGPRs per work-item inherent to GCN? Can we bypass it with a non-standard programming model (e.g. have a work-item access what would normally be another work-item's VGPRs)? (Apparently not, or at least not easily.) Since the combined size of VGPRs per CU is 4x larger than the size of local memory per CU, yet there's support for indexed access to VGPRs, this may let us run more concurrent instances of bcrypt (up to 5x more?) and thereby achieve greater performance.

The task here is to explore ways to write lower-level GPU code, possibly with specific focus on AMD GCN or/and on NVIDIA Maxwell, and also to analyze OpenCL-generated code at a low level to identify its shortcomings. We may also produce custom development tools, such as to allow for runtime code specialization (e.g. updating binary kernels implementing DES-based crypt(3) for specific salt values, which may be done a lot quicker than building OpenCL kernels from source).

Other relevant pages on this wiki:

AMD documentation:

Related third-party projects, for AMD GPUs:

for NVIDIA GPUs:

Own creative and relevant idea

Your own creative and relevant idea - please propose it to us first, then describe it again when you apply.

Licensing

With few exceptions (such as for changes to existing Linux kernel code, which is under GPLv2 anyway), we require any contributed code to be made available under a cut-down BSD license. The wiki page linked from here is for JtR, but we'd like to use this approach for most other projects as well. By applying to work on one of the ideas with us, you signify your acceptance of these terms and your intent to license your code contributions accordingly.

This approach permits us to combine contributed code with differently-licensed third-party code, and it does not lock us to a specific Open Source license for our releases.

Additionally, it permits us to create and sell proprietary revisions of our programs, which we're currently doing with JtR Pro. As you can see from the feature sets of free JtR vs. JtR Pro, we're not abusing this ability to artificially cripple our free software. The free JtR remains the main one, where features get implemented first, with “Pro” being branched off some free versions for those users who prefer a pre-packaged “product”. Overall, the introduction of JtR Pro has helped development of the free JtR so far, by letting us spend more time on the project (vs. doing more client-facing work on other projects). We assume that students applying for work on JtR are comfortable with this.

Tasks that are not currently offered to new contributors

We have a separate wiki page with archived ideas (from prior years).

ideas.1425367072.txt · Last modified: 2015/03/03 08:17 by solar
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate to DokuWiki Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Powered by OpenVZ Powered by Openwall GNU/*/Linux