Table of Contents

AMD GCN ISA assembler

How to add GCN code to JtR

These are the steps required to add GCN ASM code to John the Ripper. These are yet untested, until I manage to get CRC32 working.

  __kernel void sample(__global uint *in, __global uint *out) {
  int gid = get_global_id(0);
  int x = in[gid];
  
  x *= 2;
  x ^= in[gid] * 2;
  x *= 2;
  x ^= in[gid] * 2;
  // ...

Current status (problems)

Local Data Store

The Local Data Store (LDS) is a cache that we might use to achieve a performance increase in some cases. For example, there is a DS_ADD_U64 instruction, which performs 64-bit addition in one step, using LDS. This might help with formats that require 64-bit addition.

The problem is that getting the LDS to work with a patched binary that didn't have DS instructions originally seems impossible without enough documentation. We know that there are ATI CAL comment sections that describe the VGPR/SGPR counts. Similarly there are bytes in the ATI CAL comments that describe LDS size, but they don't seem to have any effect.

For now, we can just generate dummies that have LDS turned on, but when we need to use LDS with a format that didn't previously use it we will run into problems.

Binary difference

Here are the differences of the ATI CAL comments from the binaries of a kernel that uses LDS and one that doesn't. Note that there are other differences outside the ATI CAL notes, but it is uncertain whether they matter or not.

Format: [offset] -- [LDS value] != [non-LDS value]
0x0108 -- 00 != 30
0x0109 -- 00 != 78
0x010a -- 00 != 30
0x010b -- 00 != 30
0x0110 -- 00 != 31
0x0111 -- 00 != 38
0x0112 -- 00 != 2c
0x0113 -- 00 != 20
0x01d4 -- 04 != 03

The offsets above are counted from the first appearance of the string “ATI CAL” in the generated ELF. What we don't understand is why there are zero values in the kernel that uses LDS and non-zero values in the kernel that does not. The opposite would make sense. The last line means that the LDS version uses 4 VGPRs instead of 3. Interestingly the LDS size parameters are the same for both of the binaries.

Motivation

Why did we decide to create our own asm?

There is no official AMD GCN ISA assembler released. Realhet from the AMD developer forums created HetPas, which is capable of assembling GCN microcode, but it is a closed source tool and it does not support Linux.

Quoting Solar from the john-dev list:

Speaking of the source language for generating GCN ISA code from, we're
mostly interested in having an assembler or maybe just slightly higher
level language (such as structured assembly - with if/else, loops, etc
to reduce the need for explicit updates of the exec masks, and for
conditional branches and labels).  Maybe something along the lines of
qhasm and qhasm-cudasm, but supporting GCN.

john-dev thread about the project

john-dev thread on a possible way to optimize kernels by Solar

HetPas

TODO:

GCN Assembler status

The link to the repository can be found at the bottom of the page in the Links section.

This is where I will add and update information on the current status of the assembler as well as plans and list of items to fix/change.

All instructions are supported, except:

Concerns

Generally the Southern Islands Instruction Set Architecture documentation contains very little information on assembly format. What is worse is that it doesn't say much about operand format and order either.

There are several strange things in the ISA files generated by OpenCL. This means that simply taking the text format of the ISA and trying to parse it with gcnasm will result in errors.

DS format

The DS format takes up to 6 parameters (without extra flags). These are VDST, ADDR, DATA0, DATA1, OFFSET0, OFFSET1. Here are 2 examples from OpenCL generated ISA files.

ds_write_b32  v2, v3                      ; D8340000 00000302
ds_write2_b32  v16, v15, v17 offset1:1    ; D8380100 00110F10

The operand order for these instructions seems to be ADDR, DATA0, [DATA1], [OFFSET0], [OFFSET1].

What is not clear is where the VDST should be specified? The destination register is usually the first one specified. Are these special instructions, where VDST is not specified? From the microcode we can see that the value of the VDST part is v0 in both cases. Which instructions are special?

I haven't found any answer for these questions in the ISA architecture documentation.

VOPC format

VOPC can be represented as VOP3b, and the VOP3b format is used often in the ISA generated by OpenCL. However operands in the ISA format don't follow the usual VOP3b order, which is VDST, SDST, SRC0, SRC1, [SRC2]. Example:

v_cmp_eq_i32  s[6:7], v5, 0                ; D1040006 00010105

The v5 VGPR is used as SRC0 here. The microcode shows that the unspecified SDST value becomes s0. It is probably not used by the instruction, but the fact that it is skipped by the ISA code makes parsing much harder. Right now the example above cannot be parsed by gcnasm.

Issues

Complete list of supported instructions

GCN instruction format spreadsheet

GCN instruction format backup (pdf)

gcnasm quickstart guide

Setting up the environment

Note: Make sure OpenCL is installed before trying to build the binary_gen tool.

$ git clone https://github.com/balidani/gcnasm.git
$ cd gcnasm/src/
$ make
$ cd ../tools/opencl_kernel_binary_gen/
$ make
$ cd ../../
$ chmod +x tools/assemble.sh

Loading a simple example

Now that the environment is ready we can run the simplest example, dummy_test.isa. It contains a single instruction:

V_MOV_B32    v0, 42    ; v0 = 42 (VOP1)

This will be patched into our dummy ELF, which was generated from the following OpenCL kernel:

__kernel void sample (
  __global uint* in, 
  __constant uint* salt, 
  __global uint* out ) 
{
  int num = get_global_id(0);
  int x = 0;
  ... // The patch will come here
  out[num] = x;
}

After we patch the kernel the value of the v0 VGPR will be used as the output.

Let's see what happens when we run the assembler shell script that automates everything.

First we have to find the appropriate OpenCL platform and device IDs though.

$ run/binary_gen
Usage: run/binary_gen <platform_id> <device_id> <binary_path> <kernel_path>

Available platform IDs and devices:
Platform 0 -- NVIDIA CUDA
        Device 0 -- GeForce GTX 570
Platform 1 -- AMD Accelerated Parallel Processing
        Device 0 -- Tahiti
        Device 1 -- AMD FX(tm)-8120 Eight-Core Processor

We will use Tahiti, which is device 0 of platform 1. Let's run the script:

$ tools/assemble.sh 1 0 test/dummy_test.isa
Input (salt):
0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0),
Output:
42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,

It works! Feel free to try and create new .isa files and add more instructions.

OpenCL ELF binaries

OpenCL generates ELF binaries from the kernel's source code. These binaries can be read and optionally saved with clGetProgramInfousing the CL_PROGRAM_BINARIES flag. The binaries can later be loaded without compiling the kernel again, using the clCreateProgramWithBinary function.

Sample program

I created a utility to save the binaries generated by OpenCL. The program lists the usage and the available OpenCL platform and device names if it is run without parameters.

Usage: ./binary_gen <platform_id> <device_id> <binary_path> <kernel_path>

The binary_path parameter specifies the path where the generated binary will be saved. The kernel_path parameter specifies the path where the kernel's source code will be loaded from. If a file in the path of the binary_path parameter already exists it will be loaded instead.

Tool directory in the GitHub repository

ELF structure

The generated ELF contains the following sections:

The .text section contains another ELF file. This is where the microcode (GCN bytecode) is actually stored. These are the sections of the inner ELF:

AMD Developer Central question about ELF structure

AMD CAL programming guide

ELF patching

To be able to add arbitrary GCN microcode to a previously generated OpenCL binary, we need to patch the inner ELF's .text section with new data. To achieve this we are creating an ELF patching tool that uses libelf.

Since this is not a priority, we will use a special binary for now. This binary has a lot of NOP instructions that we can change to valid microcode. This assures that we won't overwrite other sections in the ELF and we won't have to change the information in section headers.

Libelf by example

GitHub repository of the elf-patching tool – this needs a lot of work and will be updated later

Inner ELF extraction tool

offset=`readelf -S sample.bin  | grep .text | awk -F' ' '{print $6}';`
offset=`printf "%d" "0x$offset"`
dd if=sample.bin of=sample.text ibs=1 skip=$offset

GCN microcode tutorial

This tutorial will show how to:

(Note: when the assembler will work, I should create a different tutorial that explains how to use it. This tutorial is more about understanding the ELF/microcode format)

Saving binaries
$ git clone https://github.com/balidani/gcnasm.git
$ cd gcnasm/tools/opencl_kernel_binary_gen/
$ make
$ cd ../../
$ run/binary_gen # Checking out Platform IDs
$ run/binary_gen 1 0 sample.bin tools/opencl_kernel_binary_gen/sample.cl
[*] Loading kernel from source
Saving kernel binary
Input (salt):
0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0),
Output:
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,

The input consists of 64 integers ranging from 0 to 63. Each integer in the output should be of the form input[i] + 42.

After running this we should find the generated sample.bin as well as the .isa file in the directory.

Extracting the microcode

Let's use the script that cuts the inner ELF. This is the output that we should get (or something similar).

6677+0 records in
13+1 records out
6677 bytes (6.7 kB) copied, 0.00465457 s, 1.4 MB/s

After this we can find a file called sample.text in the directory. This is the inner ELF.

$ readelf -S sample.text
There are 6 section headers, starting at offset 0xd0:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .shstrtab         STRTAB          00000000 0000a8 000028 00      0   0  0
  [ 2] .text             PROGBITS        00000000 0005dc 000064 00      0   0  0
  [ 3] .data             PROGBITS        00000000 000640 001280 1280    0   0  0
  [ 4] .symtab           SYMTAB          00000000 0018c0 000010 10      5   1  0
  [ 5] .strtab           STRTAB          00000000 0018d0 000002 00      0   0  0
(...)

If we look closely at the binary with a hex viewer we can see that the microcode is contained in the .text section.

Looking at the ISA helps because it contains the microcode in comments. It is also possible to find the microcode by searching for these values. Just make sure to remember endianness.

If we want we can use the above mentioned shell script again to cut just the .text section again.

Changing the microcode

Let's change something very simple. For example the value the integers in the output buffer are incremented by. This is 42 by default. We can find this value in the ISA:

v_add_i32     v1, vcc, 42, v1        // 00000054: 4A0202AA

Since the microcode is optimized it won't be so simple as changing a byte. To have smaller microcode and better performance, small literal constants like 42 are “inline” constants which means the instruction is not followed by 32 bits representing a constant, because it is embedded in the microcode under a special value.

We have to look at the documentation of the instruction set architecture (find it in the links below). The relevant part can be found on page 11-67 (or 161) of the document. According to this the v_add_i32 instruction takes the following form (field and length in bits):

Magic (1) | OP (6) | VDST (8) | VSRC1 (8) | SRC0 (9)

So the SRC0 value (which represents the constant 42) is at [23:32]. With the microcode value 4A0202AA this means 0x0AA. Looking up the operand types we can find this (page 12-17, or 263):

129 - 192 Signed integer 1 to 64

Given that our value is 0xAA == 170 and 170-128 == 42 this checks out. Let's change this value to 63. This means that the last byte will need to be changed to 0xBF.

We can locate the microcode part of the ELF and change the value by hand using a hex editor. After doing this we should get the correct results:

$ run/binary_gen 1 0 sample_crafted.bin tools/opencl_kernel_binary_gen/sample.cl
[*] Loading kernel from binary
Input (salt):
0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0), 8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0), 16 (0), 17 (0), 18 (0), 19 (0), 20 (0), 21 (0), 22 (0), 23 (0), 24 (0), 25 (0), 26 (0), 27 (0), 28 (0), 29 (0), 30 (0), 31 (0), 32 (0), 33 (0), 34 (0), 35 (0), 36 (0), 37 (0), 38 (0), 39 (0), 40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0), 48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0), 56 (0), 57 (0), 58 (0), 59 (0), 60 (0), 61 (0), 62 (0), 63 (0),
Output:
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,

Download

Here is a link to the current version of gcnasm: gcnasm-0.1.tar.gz

Links