I hate software: January 2016

Recently I've been reading several interesting posts on hacking Windows OpenGL drivers to exploit the "GL_ARB_get_program_binary" extension to access raw GPU instructions.

"Hacking GCN via OpenGL" by Tomasz Stachowiak (@h3r2tic) - Presentation on OneDrive
"You Compiled This, Driver. Trust Me…." about hacking Intel Haswell GPU - The blog by Joshua Barczak ‏@JoshuaBarczak

Reverse engineering is always cool and interesting. However, a lot of these efforts have been previously done not once because people directed their efforts at making open-source drivers for linux, and I believe these projects are a good way to jump-start into GPU architecture without reading huge vendor datasheets.

Since I'm interested in the GPU drivers, I decided to put up links to several open-source projects that could be interesting to fellow hackers and serve as a reference, potentially more useful than vendor documentation (which is not available for some GPUs. you know, the green ones).

Please also take a look at this interesting blog post for the detailed list of GPU datasheets - http://renderingpipeline.com/graphics-literature/low-level-gpu-documentation/

* AMD R600 GPU Instruction Set Manual. It is an uber-useful artifact because many modern GPUs (older AMD desktop GPUs, Qualcomm Adreno GPUs and Xbox 360 GPU) bear a lot of similarities with AMD R300/R600 GPU series.
http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf

* Xenia Xbox 360 Emulator. It features a Dynamic Binary Translation module that translates the AMD instruction stream into GLSL shaders at runtime. Xbox 360 has an AMD Radeon GPU, similar to the one in R600 series and Adreno
https://github.com/benvanik/xenia

Let's look at "src/xenia/gpu". Warning: the code at this project seems to be changing very often and the links may quickly become stale, so take a look yourself.
One interesting file is "register_table.inc" which has register definitions for MMIO space allowing you to figure out a lot about how the GPU is configured.
The most interesting file is "src/xenia/gpu/ucode.h which contains the structures to describe ISA.
src/xenia/gpu/packet_disassembler.cc contains the logic to disassemble the actual GPU packet stream at binary level
src/xenia/gpu/shader_translator.cc contains the logic to translate the GPU bytecode to the IR
src/xenia/gpu/glsl_shader_translator.cc the code to translate Xenia IR to GLSL
Part of the Xbox->GLSL logic is also contained in "src/xenia/gpu/gl4/gl4_command_processor.cc" file.

* FreeDreno - the Open-Source Driver for the Qualcomm/ATI Adreno GPUs. These GPUs are based on the same architecture that the ATI R300 and consequently AMD R600 series, and both ISAs and register spaces are quite similar. This driver was done by Rob Clark, who is a GPU driver engineer with a lot of experience who (unlike most FOSS GPU hackers) previously worked on another GPU at a vendor and thus had enough experience to successfully reverse-engineer a GPU.

https://github.com/freedreno/mesa

Take a look at "src/gallium/drivers/freedreno/a3xx/fd3_program.c" to have an idea how the GPU command stream packet is formed on the A3XX series GPU which is used in most Android smartphones today. Well, it's the same for most GPUs - there is a ring-buffer to which the CPU submits the commands, and from where the GPU reads them out, but the actual format of the buffer is vendor and model specific.

I really love this work and this driver because it was the first open-source driver for mobile ARM SoCs that provided full support for OpenGL ES and non-ES enough to run Gnome Shell and other window managers with GPU accelleration and also supports most Qualcomm GPUs.

* Mesa. Mesa is the open-source stack for several compute and graphic APIs for Linux: OpenGL, OpenCL, OpenVG, VAAPI and OpenMAX.

Contrary to the popular belief, it is not a "slow" and "software-rendering" thing. Mesa has several backends. "Softpipe" is indeed a slow and CPU-side implementation of OpenGL, "llvmpipe" is a faster one using LLVM for vectorizing. But the most interesting part are the drivers for the actual hardware. Currently, Mesa supports most popular GPUS (with the notable exception of ARM Mali). You can peek into Mesa source code to get insight into how to control both the 2D/ROP engines of the GPUs and how to upload the actual instructions (the ISA bytecode) to the GPU. Example links are below. It's a holy grail for GPU hacking.

"http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i915/i915_program.c" contains the code to emit I915 instructions.
The code for the latest GPUs - Broadwell - is in the "i965" driver.

The code to set up the compute unit - the number of threads/groups - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_compute.c#L76
The dreaded CURBE described in the Joshua's post - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_curbe.c#L295
The code to encode the instructions for the Broadwell EU (Execution Unit) - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_eu_emit.c
Broadwell Instruction Disassembler - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_disasm.c

While browsing Mesa code, one will notice the mysterious "bo" name. It stands for "Buffer Object" and is an abstraction for the GPU memory region. It is handled by the userpace library called "libdrm" - http://cgit.freedesktop.org/mesa/drm/tree/

* Linux Kernel "DRM/KMS" drivers. KMS/DRM is a facility of Linux Kernel to handle GPU buffer management and basic GPU initialization and power management in kernel space. The interesting thing about these drivers is that you can peek at the places where the instructions are actually transferred to the GPU - the registers containing physical addresses of the code.
There are other artifacts. For example, I like this code in the Intel GPU driver (i915) which does runtime patching of GPU instruction stream to relax security requirements.

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_cmd_parser.c

* Beignet - an OpenCL implementation for Intel Ivy-Bridge and later GPUs. Well, it actually works.
http://cgit.freedesktop.org/beignet/
The interesting code to emit the actual command stream is at http://cgit.freedesktop.org/beignet/tree/src/intel/intel_gpgpu.c

* LibVA Intel H.264 Encoder driver. Note that the proprietary MFX SDK also uses the LibVA internally, albeit a patched version with newer code which typically gets released and merged into upstream LibVA after it goes through Intel internal testing.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/
An interesting artifact are the sources with the encoder pseudo-assembly.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/interpolate_Y_8x8.asm
http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/AVCMCInter.asm

Btw, If you want zero-copy texture sharing to pass OpenGL rendering directly to the H.264 encoder with either OpenGL ES or non-ES, please refer to my earlier post - http://allsoftwaresucks.blogspot.ru/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html and to Weston VAAPI recorder.

* Intel KVM-GT and Xen-GT. These are GPU virtualization solutions from Intel. They emulate the PCI GPU in the VM. They expose the MMIO register space fully compatible with the real GPU so that the guest OS uses the vanilla or barely modified driver. The host contains the code to reinterpret the GPU instruction stream and also to set up the host GPU memory space in such a way that it's partitioned equally between the clients (guest VMs).

There are useful papers, and interesting pieces of code.

http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf
https://github.com/01org/KVMGT-qemu/blob/master/hw/display/vga-vgt.c - the QEMU PCI device code.
Linux Kernel Patch - https://github.com/01org/KVMGT-kernel/commit/a17514b9696754f21d37c112139064a91a34a73e
The LKML message with the detailed announcement - https://lwn.net/Articles/624516/

* Intel GPU Tools

A lot of useful tools. Most useful one IMHO is the "intel_perf_counters" which shows compute unit, shader and video encoder utilization. There are other interesting tools like register dumper.

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tools

It also has the Broadwell GPU assembler

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/assembler

And the ugly Shaders in ugly assembly similar to those in LibVA

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/shaders/gpgpu/gpgpu_fill.gxa

I once wrote the tool complimentary to the register dumper - to upload the register dump back to the GPU registers. I used it to debug some issues with the eDP panel initialization in my laptop

https://github.com/astarasikov/intel-gpu-dump-upload

* LunarGLASS is a collection of experimental GPU middleware. As far as I understand it, it is an ongoing initiative to build prototype implementations of upcoming Khronos standards before trying to get them merged to mesa. And also to allow the code to be used with any proprietary GPU SW stack, not tied to Mesa. Apparently it also has the "GLSL" backend which should allow you to try out SPIR-V on most GPU stacks.

The main site - http://lunarg.com/lunar-glass/
The SPIR-V frontend source code - https://github.com/LunarG/LunarGLASS/tree/master/Frontends/SPIRV
The code to convert LunarG IR to Mesa IR - https://github.com/LunarG/LunarGLASS/blob/master/mesa/src/mesa/program/ir_to_mesa.cpp

* Lima driver project. It was a project for reverse-engineering the ARM Mali driver. The authors (Luc Verhaegen aka "libv" and Connor Abbot) have decoded some part of the initialization routine and ISA. Unfortunately, the project seems to have stopped due to legal issues. However, it has significant impact on the open-source GPU drivers, because Connor Abbot went on to implement some of the ideas (namely, the NIR - "new IR") in the Mesa driver stack.

Some interesting artifacts:
The place where the Geometry Processor data (attributes and uniforms) are written to the GPU physical memory:
https://github.com/limadriver/lima/blob/master/limare/lib/gp.c#L333
"Render State" - a small structure describing the GPU memory, including the shader program address.
https://github.com/limadriver/lima/blob/master/limare/lib/render_state.h

* Etnaviv driver project. It was a project to reverse-engineer the Vivante GPU used by Freescale I.MX SoC series. It was done by Wladimir J. van der Laan. The driver has reached the mature state, successfully running games like Quake. Now, there is a version of Mesa with this driver built-in.

https://github.com/etnaviv/etna_viv
https://github.com/laanwj/etna_viv
https://github.com/etnaviv/mesa
http://wiki.wandboard.org/index.php/Etna_viv

The problem.

Another tale from the endless firmware endeavours.
The other day I ended up in a weird situation. I was locked in a room with three firmwares - one of them was built by myself from source, while the other two came in binary form from the vendor. Naturally the one I built was not fully working, and one of the binaries worked fine. So I set out on a quest to trace all PCI MMIO accesses.

When one thinks about it, several solutions come to mind. The first idea is to run a QEMU/XEN VM and pass-through the PCI device. Unfortunately the board we had did not have IOMMU (or VT-D in Intel terminology). The board had the Intel BayTrail SoC which is basically a CPU for phones.
Another option would be to binary-patch the prebuilt firmware and introduce a set of wrapper functions around MMIO access routines and add corresponding calls to printf() for debugging but this approach is very tedious and error-prone, and I try to avoid patching binaries because I'm extremely lazy.

So I figured the peripherals of interest (namely, the I2C controller and a NIC card) were simple devices configured by simple register writes. The host did not pass any shared memory to the device, so I used a simple technique. I patched QEMU and introduced "fake" PCI devices with VID and PID of the devices I wanted to "pass-through". By default the MMIO read access return zero (or some other default value) while writes are ignored. Then I went on and added the real device access. Since I was running QEMU on Linux, I just blacklisted and rmmod'ed the corresponding drivers. Then, I used "lspci" to find the addresses of the PCI BARs and mmap() syscall on "/dev/mem" to access the device memory.

A picture is worth a thousand words

The code.

The code is quite funky, but thanks to the combination of many factors (X86 being a strongly ordered architecture and the code using only 4-byte transfers) the naive implementation worked fine.
1. I can now trace all register reads and writes
2. I have verified that the "good" firmware still boots and everything is working fine.
So, after that I simply ran all three firmwares and compared the register traces only to find out that the problems might be unrelated to driver code (which is a good result because it allowed to narrow down the amount of code to explore).

The branch with my customized QEMU is here. Also, I figured QEMU now has the ultra-useful option called "readconfig" which allows you to supply the board configuration in a text file and avoid messing with cryptic command line switches for enabling PCI devices or HDDs.
https://github.com/astarasikov/qemu/tree/our_board
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7

The code for mmapping the PCI memory is here. It is a marvel of YOLO-based design, but hey, X11 does the same, so it is now industry standard and unix-way :)
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7#diff-febf7a335ad7cd658784899e875b5cc7R31

Limitations.

There is one obvious limitation, which is also the reason QEMU generally does not support passthrough of MMIO devices on systems without IOMMU (Like VT-D or SMMU).

Imagine a case where the device is not only configured via simple register transfers (such as setting certain bits to enable IRQ masking or clock gating), but system memory is passed to the device via storing a pointer in a register.

The problem is that the address coming from the guest VM (GPA, guest physical address) will not usually match the corresponding HPA (host physical address). The device however operates the bus addresses and is not aware of the MMU, let alone VMs. IOMMU solves this problem by translating the addresses coming from the device into virtual addresses, and the hypervisor can supply a custom page table to translate bus HPAs into GPAs.

So if one would want to provide a pass-through solution for sharing memory in systems without IOMMU, one would either need to ensure that the VM and guest have 1:1 GPA/HPA mappings for the addresses used by the device or come up with an elaborate scheme which would detect register writes containing memory buffers and set up a corresponding buffer in the host memory space, but then one will need to deal with keeping the host and guest buffers consistent. Besides, that would require the knowledge of the device registers, so it would be non-trivial to build a completely generic, plug-and-play solution.

I hate software

Monday, January 25, 2016

On GPU ISAs and hacking

Monday, January 18, 2016

Abusing QEMU and mmap() or Poor Man's PCI Pass-Through

The problem.

A picture is worth a thousand words

The code.

Limitations.