These are the steps required to add GCN ASM code to John the Ripper. These are yet untested, until I manage to get CRC32 working.
__kernel void sample(__global uint *in, __global uint *out) { int gid = get_global_id(0); int x = in[gid]; x *= 2; x ^= in[gid] * 2; x *= 2; x ^= in[gid] * 2; // ...
The Local Data Store (LDS) is a cache that we might use to achieve a performance increase in some cases.
For example, there is a DS_ADD_U64
instruction, which performs 64-bit addition in one step, using LDS. This might help with formats that require 64-bit addition.
The problem is that getting the LDS to work with a patched binary that didn't have DS instructions originally seems impossible without enough documentation. We know that there are ATI CAL comment sections that describe the VGPR/SGPR counts. Similarly there are bytes in the ATI CAL comments that describe LDS size, but they don't seem to have any effect.
For now, we can just generate dummies that have LDS turned on, but when we need to use LDS with a format that didn't previously use it we will run into problems.
Here are the differences of the ATI CAL comments from the binaries of a kernel that uses LDS and one that doesn't. Note that there are other differences outside the ATI CAL notes, but it is uncertain whether they matter or not.
Format: [offset] -- [LDS value] != [non-LDS value] 0x0108 -- 00 != 30 0x0109 -- 00 != 78 0x010a -- 00 != 30 0x010b -- 00 != 30 0x0110 -- 00 != 31 0x0111 -- 00 != 38 0x0112 -- 00 != 2c 0x0113 -- 00 != 20 0x01d4 -- 04 != 03
The offsets above are counted from the first appearance of the string “ATI CAL” in the generated ELF. What we don't understand is why there are zero values in the kernel that uses LDS and non-zero values in the kernel that does not. The opposite would make sense. The last line means that the LDS version uses 4 VGPRs instead of 3. Interestingly the LDS size parameters are the same for both of the binaries.
There is no official AMD GCN ISA assembler released. Realhet from the AMD developer forums created HetPas, which is capable of assembling GCN microcode, but it is a closed source tool and it does not support Linux.
Quoting Solar from the john-dev list:
Speaking of the source language for generating GCN ISA code from, we're mostly interested in having an assembler or maybe just slightly higher level language (such as structured assembly - with if/else, loops, etc to reduce the need for explicit updates of the exec masks, and for conditional branches and labels). Maybe something along the lines of qhasm and qhasm-cudasm, but supporting GCN.
john-dev thread about the project
john-dev thread on a possible way to optimize kernels by Solar
The link to the repository can be found at the bottom of the page in the Links section.
This is where I will add and update information on the current status of the assembler as well as plans and list of items to fix/change.
All instructions are supported, except:
Generally the Southern Islands Instruction Set Architecture documentation contains very little information on assembly format. What is worse is that it doesn't say much about operand format and order either.
There are several strange things in the ISA files generated by OpenCL. This means that simply taking the text format of the ISA and trying to parse it with gcnasm will result in errors.
DS format
The DS format takes up to 6 parameters (without extra flags). These are VDST, ADDR, DATA0, DATA1, OFFSET0, OFFSET1. Here are 2 examples from OpenCL generated ISA files.
ds_write_b32 v2, v3 ; D8340000 00000302 ds_write2_b32 v16, v15, v17 offset1:1 ; D8380100 00110F10
The operand order for these instructions seems to be ADDR, DATA0, [DATA1], [OFFSET0], [OFFSET1].
What is not clear is where the VDST should be specified? The destination register is usually the first one specified. Are these special instructions, where VDST is not specified? From the microcode we can see that the value of the VDST part is v0 in both cases. Which instructions are special?
I haven't found any answer for these questions in the ISA architecture documentation.
VOPC format
VOPC can be represented as VOP3b, and the VOP3b format is used often in the ISA generated by OpenCL. However operands in the ISA format don't follow the usual VOP3b order, which is VDST, SDST, SRC0, SRC1, [SRC2]. Example:
v_cmp_eq_i32 s[6:7], v5, 0 ; D1040006 00010105
The v5 VGPR is used as SRC0 here. The microcode shows that the unspecified SDST value becomes s0. It is probably not used by the instruction, but the fact that it is skipped by the ISA code makes parsing much harder. Right now the example above cannot be parsed by gcnasm.
Note: Make sure OpenCL is installed before trying to build the binary_gen
tool.
$ git clone https://github.com/balidani/gcnasm.git $ cd gcnasm/src/ $ make $ cd ../tools/opencl_kernel_binary_gen/ $ make $ cd ../../ $ chmod +x tools/assemble.sh
Now that the environment is ready we can run the simplest example, dummy_test.isa
.
It contains a single instruction:
V_MOV_B32 v0, 42 ; v0 = 42 (VOP1)
This will be patched into our dummy ELF, which was generated from the following OpenCL kernel:
__kernel void sample ( __global uint* in, __constant uint* salt, __global uint* out ) { int num = get_global_id(0); int x = 0; ... // The patch will come here out[num] = x; }
After we patch the kernel the value of the v0 VGPR will be used as the output.
Let's see what happens when we run the assembler shell script that automates everything.
First we have to find the appropriate OpenCL platform and device IDs though.
$ run/binary_gen Usage: run/binary_gen <platform_id> <device_id> <binary_path> <kernel_path> Available platform IDs and devices: Platform 0 -- NVIDIA CUDA Device 0 -- GeForce GTX 570 Platform 1 -- AMD Accelerated Parallel Processing Device 0 -- Tahiti Device 1 -- AMD FX(tm)-8120 Eight-Core Processor
We will use Tahiti, which is device 0 of platform 1. Let's run the script:
$ tools/assemble.sh 1 0 test/dummy_test.isa Input (salt): 0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0), Output: 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
It works! Feel free to try and create new .isa files and add more instructions.
OpenCL generates ELF binaries from the kernel's source code. These binaries can be read and optionally saved with clGetProgramInfo
using the CL_PROGRAM_BINARIES
flag. The binaries can later be loaded without compiling the kernel again, using the clCreateProgramWithBinary
function.
I created a utility to save the binaries generated by OpenCL. The program lists the usage and the available OpenCL platform and device names if it is run without parameters.
Usage: ./binary_gen <platform_id> <device_id> <binary_path> <kernel_path>
The binary_path
parameter specifies the path where the generated binary will be saved. The kernel_path
parameter specifies the path where the kernel's source code will be loaded from. If a file in the path of the binary_path
parameter already exists it will be loaded instead.
The generated ELF contains the following sections:
The .text section contains another ELF file. This is where the microcode (GCN bytecode) is actually stored. These are the sections of the inner ELF:
To be able to add arbitrary GCN microcode to a previously generated OpenCL binary, we need to patch the inner ELF's .text section with new data. To achieve this we are creating an ELF patching tool that uses libelf.
Since this is not a priority, we will use a special binary for now. This binary has a lot of NOP instructions that we can change to valid microcode. This assures that we won't overwrite other sections in the ELF and we won't have to change the information in section headers.
GitHub repository of the elf-patching tool – this needs a lot of work and will be updated later
offset=`readelf -S sample.bin | grep .text | awk -F' ' '{print $6}';` offset=`printf "%d" "0x$offset"` dd if=sample.bin of=sample.text ibs=1 skip=$offset
This tutorial will show how to:
(Note: when the assembler will work, I should create a different tutorial that explains how to use it. This tutorial is more about understanding the ELF/microcode format)
$ git clone https://github.com/balidani/gcnasm.git $ cd gcnasm/tools/opencl_kernel_binary_gen/ $ make $ cd ../../ $ run/binary_gen # Checking out Platform IDs $ run/binary_gen 1 0 sample.bin tools/opencl_kernel_binary_gen/sample.cl [*] Loading kernel from source Saving kernel binary Input (salt): 0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0), Output: 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,
The input consists of 64 integers ranging from 0 to 63. Each integer in the output should be of the form input[i] + 42
.
After running this we should find the generated sample.bin as well as the .isa file in the directory.
Let's use the script that cuts the inner ELF. This is the output that we should get (or something similar).
6677+0 records in 13+1 records out 6677 bytes (6.7 kB) copied, 0.00465457 s, 1.4 MB/s
After this we can find a file called sample.text in the directory. This is the inner ELF.
$ readelf -S sample.text There are 6 section headers, starting at offset 0xd0: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .shstrtab STRTAB 00000000 0000a8 000028 00 0 0 0 [ 2] .text PROGBITS 00000000 0005dc 000064 00 0 0 0 [ 3] .data PROGBITS 00000000 000640 001280 1280 0 0 0 [ 4] .symtab SYMTAB 00000000 0018c0 000010 10 5 1 0 [ 5] .strtab STRTAB 00000000 0018d0 000002 00 0 0 0 (...)
If we look closely at the binary with a hex viewer we can see that the microcode is contained in the .text
section.
Looking at the ISA helps because it contains the microcode in comments. It is also possible to find the microcode by searching for these values. Just make sure to remember endianness.
If we want we can use the above mentioned shell script again to cut just the .text
section again.
Let's change something very simple. For example the value the integers in the output buffer are incremented by. This is 42 by default. We can find this value in the ISA:
v_add_i32 v1, vcc, 42, v1 // 00000054: 4A0202AA
Since the microcode is optimized it won't be so simple as changing a byte. To have smaller microcode and better performance, small literal constants like 42 are “inline” constants which means the instruction is not followed by 32 bits representing a constant, because it is embedded in the microcode under a special value.
We have to look at the documentation of the instruction set architecture (find it in the links below). The relevant part can be found on page 11-67 (or 161) of the document. According to this the v_add_i32 instruction takes the following form (field and length in bits):
Magic (1) | OP (6) | VDST (8) | VSRC1 (8) | SRC0 (9)
So the SRC0 value (which represents the constant 42) is at [23:32]. With the microcode value 4A0202AA
this means 0x0AA
. Looking up the operand types we can find this (page 12-17, or 263):
129 - 192 Signed integer 1 to 64
Given that our value is 0xAA == 170
and 170-128 == 42
this checks out. Let's change this value to 63. This means that the last byte will need to be changed to 0xBF
.
We can locate the microcode part of the ELF and change the value by hand using a hex editor. After doing this we should get the correct results:
$ run/binary_gen 1 0 sample_crafted.bin tools/opencl_kernel_binary_gen/sample.cl [*] Loading kernel from binary Input (salt): 0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0), Output: 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
Here is a link to the current version of gcnasm: gcnasm-0.1.tar.gz