Quite a lot of specifications for Ghidra have already been written for ordinary processor languages, but nothing for graphical ones. It is understandable, because it has its own specifics: predicates, constants through which parameters are transmitted, including other things inherited from shaders. In addition, the format used to store the code is often proprietary, and you need to reverse it yourself.

In this article, we’ll take a look at what’s what with two examples.

The first program is the simplest axpy (analog of hello world for GPGPU). The second helps to understand the implementation of conditions and jumps on the GPU, because everything is different there.

All Nvidia languages ​​use little endian encoding, so immediately copy the bytes from the hex editor to some notepad (for example, Notepad ++) in the opposite order of 8 pieces (the length of the instructions here is constant). Then, through the programmer's calculator (the one from Microsoft is suitable), we translate it into binary code. Next, we look for matches, make up the mask of the instruction, then the operands. To decode and search for the mask, the hex editor and the cuobjdump disassembler were used, sometimes assembler is required, as in AMDGPU (because the disassembler is not available there, but this is a topic for a separate article). It works like this: try sequentially inverting all the suspicious bits in the calculator, then we get a new hexadecimal value for the bytes, we substitute them in a binary compiled via nvcc or assembler if it exists, which is not always the case. Then we check through cuobjdump.

I spread the source in the format (mainly in C, without pluses and OOP for closer connection with the machine GPU code), then disasm + bytes at once, because it’s more convenient, they just do not need to be interchanged.

Copy to axpy.cu and compile via cmd: nvcc axpy.cu --cubin --gpu-architecture sm_30
The resulting ELF file named axpy.cubin is disassembled in the same place: cuobjdump axpy.cubin -sass

Example 1:

__global__ void axpy(float param_1, float* param_2, float* param_3) { unsigned int uVar1=threadIdx.x; param_2[uVar1]=param_1 * param_3[uVar1]; } 

Dump
/*0000*//* 0x22c04282c2804307 *//*0008*/MOV R1, c[0x0][0x44];/* 0x2800400110005de4 *//*0010*/S2R R0, SR_TID.X;/* 0x2c00000084001c04 *//*0018*/MOV32I R5, 0x4;/* 0x1800000010015de2 *//*0020*/ISCADD R2.CC, R0, c[0x0][0x150], 0x2;/* 0x4001400540009c43 *//*0030*/LD.E R2, [R2];/* 0x8400000000209c85 *//*0038*/ISCADD R4.CC, R0, c[0x0][0x148], 0x2;/* 0x4001400520011c43 *//*0040*//* 0x20000002e04283f7 *//*0048*/IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x14c];/* 0x208a800530015c43 *//*0050*/FMUL R0, R2, c[0x0][0x140];/* 0x5800400500201c00 *//*0058*/ST.E [R4], R0;/* 0x9400000000401c85 *//*0060*/EXIT;/* 0x8000000000001de7 *//*0068*/BRA 0x68;/* 0x4003ffffe0001de7 *//*0070*/NOP;/* 0x4000000000001de4 *//*0078*/NOP;/* 0x4000000000001de4 */ 


Decompilation Result
void axpy(float param_1,float *param_2,float *param_3) { uint uVar1; uVar1=*&threadIdx.x; param_2[uVar1]=param_3[uVar1] * param_1; return; } 


Example 2:

__global__ void predicates(float* param_1, float* param_2) { unsigned int uVar1=threadIdx.x + blockIdx.x * blockDim.x; if ((uVar1 > 5) & (uVar1 < 10)) param_1[uVar1]=uVar1; else param_2[uVar1]=uVar1; } 

Dump
/*0000*//* 0x2272028042823307 *//*0008*/MOV R1, c[0x0][0x44];/* 0x2800400110005de4 *//*0010*/S2R R0, SR_TID.X;/* 0x2c00000084001c04 *//*0018*/S2R R3, SR_CTAID.X;/* 0x2c0000009400dc04 *//*0020*/IMAD R0, R3, c[0x0][0x28], R0;/* 0x20004000a0301ca3 *//*0028*/MOV32I R3, 0x4;/* 0x180000001000dde2 *//*0030*/IADD32I R2, R0, -0x6;/* 0x0bffffffe8009c02 *//*0038*/I2F.F32.U32 R4, R0;/* 0x1800000001211c04 *//*0040*//* 0x22c042e04282c2c7 *//*0048*/ISETP.GE.U32.AND P0, PT, R2, 0x4, PT;/* 0x1b0ec0001021dc03 *//*0050*/@P0 ISCADD R2.CC, R0, c[0x0][0x148], 0x2;/* 0x4001400520008043 *//*0058*/@P0 IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x14c];/* 0x208680053000c043 *//*0060*/@P0 ST.E [R2], R4;/* 0x9400000000210085 *//*0068*/@P0 EXIT;/* 0x80000000000001e7 *//*0070*/ISCADD R2.CC, R0, c[0x0][0x140], 0x2;/* 0x4001400500009c43 *//*0078*/MOV32I R3, 0x4;/* 0x180000001000dde2 *//*0080*//* 0x2000000002e04287 *//*0088*/IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x144];/* 0x208680051000dc43 *//*0090*/ST.E [R2], R4;/* 0x9400000000211c85 *//*0098*/EXIT;/* 0x8000000000001de7 *//*00a0*/BRA 0xa0;/* 0x4003ffffe0001de7 *//*00a8*/NOP;/* 0x4000000000001de4 *//*00b0*/NOP;/* 0x4000000000001de4 *//*00b8*/NOP;/* 0x4000000000001de4 */ 


Decompilation Result
void predicates(float *param_1,float *param_2) { uint uVar1; uVar1=*&blockIdx.x * (int)_DAT_constants_00000028 + *&threadIdx.x; if (uVar1 - 6 < 4) { param_1[uVar1]=(float)uVar1; return; } param_2[uVar1]=(float)uVar1; return; } 


It is easy to guess that the tests were originally tailored to machine code so that the compiler had nothing to optimize. For everything else, you would have to manually cancel the optimizations. In complex examples, this may not be possible at all, so for such cases you have to trust the decompiler and the frontend.

In general, the rule is this: to test the front-end, we take any simple (with a minimum of possible optimizations) first suitable (reproducing errors) example. For the rest, the decompiled code will already be with optimizations (or just somehow correct through refactoring). But for now, the main task is at least just the right code that does the same thing as the machine code. This is Software Modeling. “Software modeling” itself does not involve refactoring, translating C into C++, restoring classes, and even more so such things as identifying templates.

Now we are looking for patterns for mnemonics, operands and modifiers.

To do this, compare the bits (in binary) between the suspicious instructions (or strings, if they are more convenient to call). You can also take advantage of the fact that other users post in their questions on stackoverflow like “help understand binary/sass/machine code”, use tutorials (including in Chinese) and other resources.So, the main operation number is stored in bits 58-63, but there are additional bits 0-4 (they distinguish between the instructions “I2F”, “ISETP”, “MOV32I”), somewhere in their place 0-2 (to neglect 3- 4 bits in empty instructions, in the specification they are marked as “UNK”).

For registers and constant numbers, you can experiment with a disassembler in order to find all the bits that affect the dump output, like the one laid out under the spoiler. All the fields that I managed to find are in the specification on Github, file CUDA.slaspec, section token.

Then you need to come up with addresses for the registers, again they are on Github. This is necessary because at the micro level, Sleigh registers registers as global variables in space with the type “register_space”, but since Since their space is not marked as “inferable” (and certainly it cannot be), then they in the decompiler become either local variables (most often with the “Var” interfix, but sometimes there was also a “local” prefix) or parameters (the “ param_ "). SP was never useful, it is needed mainly formally to ensure the decompiler is working. A PC (something like IP from x86) is needed to emulate.

Then there are predicate registers, something like flags, but more of a “general purpose” than for a predetermined goal, such as overflow, (non) equal to zero, etc.
Then, a lock register for modeling a bunch of instructions ISCADD.CC and IMAD.HI, because the first of them in my implementation calculates for itself and for the second, in order to avoid transferring part of the sum to the high 4 bytes, because this will ruin the decompilation. But then you need to lock the next register until the IMAD.HI operation is completed. Something similar, i.e. the discrepancy between the official documentation and the expected output of the decompiler was already in the SPU module for the same Ghidra.

Then there are special registers that are implemented through cpool so far. In the future, I plan to replace them with the characters defined by default for some kind of "inferable" space. These are the same threadIdx, blockIdx.

Then we bind the variables to the fields dest, par0, par1, par2, res. Then come the sub-tables, and after them - for the sake of which everything was started - the main (root) tables with the main instructions.

Here it is necessary to strictly follow the “operand mnemonics” format, however, relief is given for modifiers, which, nevertheless, must be attached to the mnemonics or to the section with operands. No other formats are allowed, even the same Hexagon DSP asm will have to be adapted to this syntax, which, however, is not very difficult.

The final step will be to write an implementation for instructions in the Pcode firmware language. The only thing I would like to note from the first example is the ISCADD.CC and IMAD.HI instructions, where the first one takes a pointer to registers and dereferences them as pointers for 8 bytes instead of 4. This is done intentionally in order to better adapt to the decompiler and his behavior, despite what is written in the Nvidia documentation about transferring part of the amount.

For the second example, it is better to check the decompiler settings opposite the inscription “Simplify predication”. The point is that predicates are one and the same condition for different instructions, essentially nothing more than the well-known "SIMD", or its next equivalent. Those. if the predicate bit is set, then the instructions are executed, moreover, in a row.

You also need to make a habit of immediately writing an implementation for each instruction, and not just a prototype (operand mnemonics), because there is also a decompiler, emulator and other analyzers.
But in general, writing an implementation in Pcode is an even simpler task than writing a grammar for a byte decoder. It was quick to fix the implementation for some complex instructions from x86 (and not only), thanks to a very convenient intermediate language, a single midland (optimizer), 2 backends (mainly C; as an alternative - Java/C #, more like the last, etc. K. goto appears from time to time, but not labeled break).
In the following articles, there may also be frontends for managed languages ​​such as DXBC, SPIR-V, they will use the Java/C # backend. But so far, only machine codes are planned, because bytecodes require a special approach.

Проект
Ghidra

Справки:

Pcode
Sleigh .

Source