Monday, October 24, 2016

running without an ARM and a leg: Windows Mobile in QEMU


One project I had in mind long time ago was getting Windows Mobile to run in QEMU.
I think it's a lovely OS with a long history and the project seemed like a nice tecnhical challenge.

Initially I started working on it two years ago back in 2014 and the plan was to later run it in KVM on Cortex-A15 with Virtualization Extensions. However, I had to suspend it because I started working on two other challenges - running XNU on Xen (aka LLC) and later doing GSoC (running FreeBSD in ARM emulator).

Now since I've got some free time on my hands, I decided to finally get back to this project and cross it off my TODO list.

Choosing the emulation target.

In order to run Windows CE on QEMU (or any OS for that matter) it would be necessary to either develop a Board Support Package with all the drivers for a specific virtual machine or take the opposite approach and emulate some machine for which there already exists a ROM image.

For Windows, there is the emulator developed by Microsoft which is unsurprisingly called just Device Emulator. It emulates a real board - MINI2440 based on Samsung S3C2440 SoC which is an ancient ARMv4 CPU. Turns out, this is the same SoC that's used in OpenMoko so there is an old fork of QEMU with the support for most of the peripherals. So the choice of the platform seemed a no-brainer.

So I took the QEMU fork supporting MINI2440 and tried to adapt it to running the unmodified Windows Mobile images from Microsoft. Needless to say, I made sure the images are placed into memory at the correct addresses but the code seemed to crash spontaneously and never got past enabling MMU.

The first idea that comes to mind is of course to take the latest QEMU and see if it fixes anything. However, trying random changes until something works is actually quite a crappy approach.

So, I decided to single-step the execution and see what happens. QEMU provides the very useful GDB interface (which can be activated with the "-s -S" switches) for this purpose.

Caches are hard.

The first surprise came from the bootloader code. Before launching the kernel, Windows CE bootloader disables the MMU. At this moment QEMU crashes spectacularly. Initially I tried hacking around the issue by adding the code to translate the addresses into the "exec-all.h". However, it didn't solve the problem and the heuristic started looking too complex which suggested I'm on the wrong way.

I started thinking of and realized that disabling MMU is a tricky thing because on ARM Program Counter is usually ahead of the current instruction so the CPU has to fetch a couple instructions ahead. So we have a caching problem. In this Windows CE code, there is a NOP instruction between disabling the MMU and jumping to the new PA from a register. The hypothesis was that QEMU did not fetch the needed instructions correctly and it was necessary to add a translation cache for them. The reality was more funny.

As it turned out, QEMU contained a hack to cache a couple instructions because... because Linux kernel for PXA270 had the same problem in the opposite scenario - when the MMU was enabled. I decided to comment out the hack and it made the boot-up progress further. (PXA Cache Hack).

Stacks are not easy either.

Next thing I know is that soon after enabling the MMU the code crashes when trying to access the stack at a very peculiar virtual address of 0xffffce70. I examined the QEMU translation code and found out that it correctly locates the physical address but the permission bits are incorrect. I decided to force it to return RW access for the particular address range and voila - Windows Mobile boots to Home Screen successfully. (Hack to force RW stack).

Windows Mobile 5.0 Smartphone Edition on QEMU on Mac OS X.
Now that everything seemed to boot, I decided to take another look at the MMU issue and fix it properly. The first idea I had was to compare ARM920T (ARMv4) and ARM1136 (ARMv5) page table formats. It is worth noting that ARMv4 did not have TEX remap bits, and also last level page tables had different type (last two bits). It turned out that QEMU (probably as real ARM CPUs) supported all types of pages, even ARMv4 on ARMv6 target, and TEX/caching bits were simply ignored. After careful examination of the code and all the page table parsing I found a typo that was already fixed in the upstream QEMU (MMU Typo).

You can grab the cleaner tree without the ugly intermediate hacks: . If you need to have a look at the older WIP stuff, it's at .

Running Pocket PC version.

Preparing PocketPC Image.

Windows Mobile Standard aka Smartphone comes shipped as the NOR flash image and it is XIP (Execute-in-Place). Pocket PC Image comes in a different form. It comes as an image already relocated to the RAM so we need to launch it directly via the "-kernel" parameter. We need to create the NOR memory image for it. We can use the smartphone image as a base, but an empty 32MB file should work as well.

  • Grab the "romtools" scripts at github:
  • python ../ppc_50/_208PPC_USA_bin
  • dd if=_208PPC_USA_bin-80070000-81491ed0.bin of=SPHIDPI_BIN bs=$((0x30000)) seek=1 conv=notrunc
  • ./arm-softmmu/qemu-system-arm -show-cursor -m 256 -M mini2440 -serial stdio -kernel ../wm5/50_ppc/_208PPC_USA_bin-80070000-81491ed0.bin  -pflash ../wm5/50_ppc/SPHIDPI_BIN

This one did not boot and kept hanging at the boot splash screen. The host CPU usage was high and it indicated there was some busy activity inside the VM like an IRQ storm. After hooking up the debugger it turned out that the VM was hanging in two busy loops.

One of them was in the Audio driver - during the splash screen Windows plays a welcome sound. This was worked around by setting the "TX FIFO Ready" status in the sound codec. The second freeze was in the touchscreen driver but that looks like a MINI2440 emulation bug - once the touchscreen is disabled, the workqueue timer in qemu is permanently disabled and the status bits are not updated properly. I commented out the routine which disabled the touchscren controller since it's a VM anyway (not a real device where it would have a power-saving impact).
Obligatory screensot: Windows Mobile 5.0 Pocket PC

Retarded idea.

I was fantasizing about adding a logic to QEMU which would detect if emulation was stuck invoking MMIO device handler in a loop and fuzzing the returned register value until it was unstuck. While not very practical, one could reverse-engineer the register bits blindly this way.

Further ideas.

I don't think I'll invest more time into this project because there's little value but I'm considering developing an Android port of this QEMU fork just for the fun of it. Perhaps a better option would be to emulate a newer SoC such as Qualcomm 8250 and run an ARMv7 image from HTC HD2.

It's quite sad that most if not all Android phones ship with HYP mode (Hypervisor) disabled so KVM is a no-go. On the other hand, allowing to run custom hypervisors opens up a pandora box of security issues so it might be a good decision. Luckily for us who like to tinker with hardware, most TrustZone implementations also contain exploitable bugs so there are some possibilities to uncover the potential :)

Today, many companies working with embedded SoCs, seek GPU virtualization solutions. The demand is particularly high in the automotive sector where people are forced to use eclectic mixes of relatively old SoCs and retaining binary compatibility. So it would be interesting to prototype a solution similar to Intel's GVT.

I think it would be nice to use Qualcomm Adreno GPU as an emulation target. It is relatively well reverse-engineered - there exists Mesa FreeDreno driver for it, Qualcomm commits patches to the upstream KMS driver and most importantly the GPU ISA is similar to older AMD Radeon GPUs which is extensively documented by AMD. Besides, a similar GPU is used in Xbox 360 so one more place to learn is Xenia emulator which simulates the Xbox GPU.

Wednesday, September 28, 2016

Trying out and debugging Open-Source OpenCL on AMD Radeon RX480


I have decided to check out the state of OpenCL support for AMD GPUs in Linux Open-Source stack. In this post I will describe the experience of debugging issues across the whole stack trying to get a few basic apps up and running.

I have the AMD RX480 GPU which supports GCN 8.0.1 instruction set and has the code name "polaris10". At the time of writing, both Linux kernel and the Mesa in Debian were too old and did not support this GPU. Besides, as we will see later, the OpenCL in Mesa does not work out of the box and we would need to learn to build it from source anyway.

Building the software.

A page at FreeDesktop website has some relatively outdated instructions but the general approach is the same.
You will need to install a relatively fresh kernel (I built Linux 4.7.0-rc7). I also installed the polaris10 firmware manually but now it seems to be shipped in Linux/Debian.

I'm posting the steps I went through to build LLVM and Mesa with OpenCL support. After writing it I realized that perhaps everything here is redundant I should write a repo manifest to clone everything with one command.

Getting the sources.

I'm also posting the folder tree and git hashes just in case.

Build CLC

CLC is the runtime library for the OpenCL. It contains code that is compiled to LLVM bitcode and linked against your apps. It provides intrinsics and functions for the functions defined by the OpenCL standard such as "get_global_id" and "mad". In case you're wondering, CUDA works exactly the same way and the binary SDK from NVIDIA ships the relevant bitcodes (and you can disassemble them with llvm-dis if you're interested).

Actually, some of the definitions failed to compile with latest LLVM and I needed to add the explicit type casting. You can use the patch (from this github gist and apply it on top of 88b82a6f70012a903b10dfc1e2304d3ef2e76dbc to fix it.

git clone
./ -g ninja --with-llvm-config=/home/alexander/Documents/workspace/builds/llvm/build/bin/llvm-config
ninja install

Get LLVM Sources.

mkdir -p ~/Documents/workspace/builds/llvm/
I have a useful script to check out the latest llvm, clang and libcxx.

Get LLVM ROC branch.

This is optional. I used the ROC branch initially because I thought it would fix the issue with codegen (FLAT instructions) but it did not and otherwise it seems to behave identical to vanilla LLVM.

git remote add roc
git checkout roc/amd-common
git fetch roc
cd tools/clang/
git remote add roc
git fetch roc
git checkout roc/amd-common

List of git hashes:
  • llvm - 1819637 Merge branch amd-master into amd-common
  • llvm/tools/clang/tools/extra - 079dd6a [clang-rename] Add comment after namespace closing
  • llvm/tools/clang - f779a93 Merge branch amd-master into amd-common
  • llvm/projects/libcxx - d979eed Fix Bug 30240 - std::string: append(first, last) error when aliasing.
  • llvm/projects/compiler-rt - 5a27c81 asan: allow __asan_{before,after}_dynamic_init without registered globals
  • llvm/projects/libcxxabi - 9f08403 [lit] Replace print with lit_config.note().

Build LLVM.

#create the build directory
mkdir -p ~Documents/workspace/builds/llvm/build/
cd cd ~Documents/workspace/builds/llvm/build/


#add the symlink to make LLVM pick up internal headers when building Mesa
cd ~Documents/workspace/builds/llvm/build/include
ln -s $(echo $PWD/../tools/clang/include/clang) clang

In principle, it is not necessary to install llvm, it's enough to add it to the PATH and clang will pick up the necessary libraries itself.

Build MESA

Before building Mesa, we need to prepend the path to the "bin" directory of our custom LLVM build to the PATH variable so that clang is picked up as the compiler. I also had to add a symlink to the source code in the build directory because some headers were not getting picked up but I think there's a cleaner way to add it to CFLAGS.

I was using Mesa git 0d7ec8b7d0554382d5af6c59a69ca9672d2583cd.

git clone git://

The seems to have the incorrect regex for getting LLVM version which causes compilation to fail with the latest LLVM (4.0.0). Here is a patch to fix it and also force the radeonsi chip class to VI (Volcanic Islands). The latter is not strictly necessary but I used it during debugging to ensure the corect code path is always hit. Grab the diff at

export PATH=~/Documents/workspace/builds/llvm/build/bin:$PATH
cd ~/Documents/workspace/builds/mesa/mesa/
make clean

./ --enable-texture-float  --enable-dri3 --enable-opencl --enable-opencl-icd --enable-sysfs --enable-gallium-llvm --with-gallium-drivers=radeonsi --prefix=/opt/my_mesa --with-egl-platforms=drm --enable-glx-tls
make install

Now, before running any OpenCL application, we'll need to override the library path to point to our custom Mesa.
export LD_LIBRARY_PATH=/opt/my_mesa/lib:/home/alexander/Documents/workspace/builds/llvm/build/lib

Useful Links

AMD Presentations about GCN ISA.

GCN ISA Manual

Intel OpenCL Samples


I've used the older version because I thought it was in the tarball and the latest one seemed to be an executable file (though actually it was a tarball with an executable script).

Trying it out.

Assertion in LLVM codegen.

So we can try running any OpenCL application now and we'll hit the assertion.
./BitonicSort -p Clover

After reading the source code and LLVM git log it turns out.
So what can we do? Let's try to see if we can force the LLVM to emit the abovementioned "FLAT Atomics".
Turns out the code is already there and there is a flag, which is enabled by default when the LLVM target is "AMD HSA".

Now, let's think what could the possible limitations of this approach be?

FLAT Instructions

Let's see the description from the GCN Manual. "Flat memory instructions let the kernel read or write data in memory, or perform atomic operations on data already in memory. These operations occur through the texture L2 cache.".

I have not fully understood the difference between these new FLAT instructions and older MUBUF/MTBUF. As it seems to me, before GCN3 it used to be the case that different address spaces (global, private etc) could only be accessed through different instructions and FLAT instructions allow accessing any piece of GPU (and host) memory by the virtual address (hence the name since virtual address space is flat). So it seems that as long as the kernel driver sets up GPU page tables correctly and we're only using the memory allocated through OpenCL API we should be fine.

FWIW, let's try running the code and see if it works.

LLVM Flat Instructions hack

As it has been mentioned before, we need to force LLVM to generate the "FLAT" instructions to access memory in GPU code.
A proper way to fix it would be to add the options to the Mesa source code into the location where it instantiates the LLVM compiler (clang).
To save us some time we can hack the relevant piece of the code generator in the LLVM directly (see the patch at the end of the article).

Trying it out again.

I have tried the following samples from the AMD App SDK and Intel Samples. I didn't want to write sample code myself and besides OpenCL claims to be portable so running the code from other IHVs or SDKs should be a good stress test of the toolchain.


  • BlackScholesDP
  • BinominalOption
  • BitonicSort
  • BinarySearch
  • BufferBandwidth


  • Bitonic Sort
  • Montecarlo
  • God Rays
All of them worked and the "verification" step which computes the data on the host and compares to the GPGPU result has passed! You can take a look at the screenshots and logs at the end of the post.

The "God Rays" demo even produces the convinceable picture.

Running Protonect (libfreenect2)

One of the apps I'm particularly interested in running is Protonect which is the demo app for libfreenect2, the open-source driver for Microsoft Kinect v2 RGBD ToF camera. Let's build it with OpenCL support and invoke it from the shell via "./Protonect cl".

And we're hitting an assertion!
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.
Since it happens in a destructor for the purposes of testing we can simply comment it out because the worst thing that could happen is a memory leak.

Let's try running it again!
And we're hitting another error. This time it is "unsupported initializer for address space". Okay, let's debug it. First, let's grep the string verbatim.

Good, we're hitting in only one place. Now, debugging message is not particularly helpful because it does not give us the precise location or the name of the variable which caused this error, it only shows the prototype of the function. Let's try to just print the address space type and try to find out what might be causing it. (see the patch at the end of the article).

What's this? Good, good, we see that the address space enum value is "0". Looking it up reveals that it's a private address space. Okay, what could cause the private address space to be used? Function-local arrays! Let's look for one! Great, here it is, see the "const float gaussian[9]" array in the "filterPixelStage1" function? Let's try commenting it out and replacing the "gaussian[j]" access with some constant (let it be 1.0f since it's a weighed average, otherwise if we choose 0.0f we'll see nothing in the output buffer). Yay, it worked!

Since we could not longer use the private address space we would need to find a way to get rid of the locally-declared array.OpenCL 1.1 does not support the static storage class.
One option would be to add another kernel argument and just pass the array there.
It might be slower depending though because it would get placed into the slower region of cache-coherent memory.

Another option would be to compute the values in-place and since it's just a 3x3 convolution kernel for a gaussian blur it's easy to come up with a crude approximation formula which is what I've done (see the patch at the end of the post).

Sor far, Protonect works. Performance is subpar but not completely awful. It's around 4 times slower than the NVIDIA GTX970 with the binary driver (in most real-world scenarios GTX970 and RX480 are quite close). I think that with a bit of profiling it can be sped up drastically. In fact, one of the contributing factors might be that my display is connected to the Intel iGPU and PCIE bandwidth is saturated by the framebuffer blitting (it's 4K after all). I'll try with OpenGL running on Radeon next time.

RadeonOpenCompute Initiative.

Since OpenCL is crap and CUDA is all the hype, it makes now wonder many people want to run CUDA on all GPUs.
In principle, to run CUDA on non-NVIDIA hardware, one needs to implement the CUDA runtime and the lowering from the NVPTX intermediate language to the native ISA. The challenging part is actually building the source code because one would need the proprietary headers from the NVIDIA toolchain. One could create the fake headers to mimic the CUDA toolchain but it is a shady legal area especially when it comes down to predefined macros and constants.

In fact, there's not much need to emit NVPTX at all and you can lower straight to the ISA. What AMD have done to work around legal issues is they've come up with their own language called "HIP" which mimics most of CUDA design but keywords and predefined macros are named differently. Therefore, porting is straightforward even with a search/replace function, but there's an automated translator based on clang.

GCN old RX480 vs R390

It looks interesting that Polaris10 (RX480) seems to use an older version of the ISA (8.0.1) while the older R390 uses 8.0.3. Not sure if it's a bug in documentation. However, it's interesting that AMD GPUs consist of multiple units (such as video encoder/decoder) and they seem to be picked up in arbitrary order when designing a new chip.

HIP without Mesa.

HIP/ROC ships its own runtime, and since all memory access is done through DRM via ioctls to "/dev/drm/cardX", Mesa is tecnhically not needed to implement OpenCL or whatever compute API.

However, the open question is buffer sharing between different APIs. I have came across this issue before when dealing with an Intel GPU. The good news is that on Linux there's an EGL extension to export the DRM buffer object (BO) from the GLES context (but not from GLX). You can read the old article about it here:

Outstanding issues.

Slow Compilation time for Kernels.

While running OpenCL code samples and Protonect, I've noticed that they take several seconds to start up compared to immediate start when using Intel OpenCL driver (Beignet). I think it might get caused by LLVM. It would be a good idea to profile everything using the Linux "perf" tool.

Private Address Spaces.

As we've seen with libfreenect2 kernels, private address spaces do not work. It is necessary to figure out if they are not supported at all or it is just a simple bug.
Until this is resolved it effectively renders a lot of GPGPU applications unusable.

LLVM Assertions (Bugs).

As mentioned before, running Protonect and compiling libfreenect2 kernels yields the following assertion:
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

While trying to run one of the OpenCL samples I hit yet another assertion:
'clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.


As we can see, the OpenCL support is still not mature with the Mesa/Clover stack.
However given the RadeonOpenCompute initiative I'm sure most of the missing features will be in place soon.
So far I'm glad that the just-releases GPU support is on par with older models and working around the issues was not too hard for me.
I've also satisfied part of my desire to understand the interaction between different components involved in the OpenCL/Compute pipeline.

I think for a start I will look at the LLVM assertions and see if I can debug them or prepare the test cases to submit upstream.
Next up I'll be trying out HIP to build some CUDA samples.

One idea I had in mind for quite some time was virtualizing a mobile GPU. I think Qualcomm Adreno is a good target because it's relatively well supported by the FreeDreno driver and the ISA is similar to other AMD chips. The plan is to add the ISA decoder and MMIO space emulation to QEMU so that it can be used both in KVM on ARM and in emulation mode on Intel. Of course, the most nerdy way to do it would be to make a translator from the guest ISA to the host ISA. But for a start we could reuse the Virgil driver as a target.
I think it would be a very useful thing for running legacy applications in a virtualized environment (such as Windows RT or automotive IVI systems) and could aid in security engineering.
Hopefully I will have enough motivation and time to do it before I'm bounded by an NDA :)

Latest update!

Also, check out the latest news! Looks like Mesa has now switched to using the HSA ABI by default which means that the hack for the FLAT instructions will not be needed with more recent versions and they will be enabled automagically!

I started trying OpenCL on RX480 around 2 weeks ago, then I spent 1 week debugging and 1 week I was away. Meanwhile some changes seem to have landed upstream and some of the hacking described here may be redundant. I urge you to check with the latest source code but I decided to keep this post just to describe the process of debugging I went through.

Extra: Logs and Screenshots.

God Rays from Intel OpenCL samples.
Protonect running on AMD FOSS OpenCL stack.

LLVM Force FLAT instructions hack.

diff --git a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 3c4b5e7..f6d500c 100644
--- a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -46,7 +46,7 @@ AMDGPUSubtarget::initializeSubtargetDependencies(const Triple &TT,
   // disable it.

   SmallString<256> FullFS("+promote-alloca,+fp64-denormals,+load-store-opt,");
-  if (isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
+  if (1 || isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
     FullFS += "+flat-for-global,+unaligned-buffer-access,";
   FullFS += FS;

Patch for Libfreenect2

@@ -102,8 +102,8 @@ void kernel processPixelStage1(global const short *lut11to16, global const float
  * Filter pixel stage 1
-void kernel filterPixelStage1(global const float3 *a, global const float3 *b, global const float3 *n,
-                              global float3 *a_out, global float3 *b_out, global uchar *max_edge_test)
+void kernel filterPixelStage1(__global const float3 *a, __global const float3 *b, __global const float3 *n,
+                              __global float3 *a_out, __global float3 *b_out, __global uchar *max_edge_test)
   const uint i = get_global_id(0);

@@ -113,7 +113,7 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
   const float3 self_a = a[i];

   const float3 self_b = b[i];


   if(x < 1 || y < 1 || x > 510 || y > 422)
@@ -155,7 +155,9 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
         const int3 c1 = isless(other_norm * other_norm, threshold);

         const float3 dist = 0.5f * (1.0f - (self_normalized_a * other_normalized_a + self_normalized_b * other_normalized_b));
-        const float3 weight = select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        //const float3 weight = 1.0f;//select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        const float gj = exp(0.6 - (0.3 * (abs(yi) + abs(xi))));
+        const float3 weight = select(gj * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);

LLVM Patch for assertion in AsmPrinter destructor.

diff --git a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 0fed4e9..0d63a2a 100644
--- a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -114,6 +114,7 @@ AsmPrinter::AsmPrinter(TargetMachine &tm, std::unique_ptr<MCStreamer> Streamer)

 AsmPrinter::~AsmPrinter() {
+       return;
   assert(!DD && Handlers.empty() && "Debug/EH info didn't get finalized");

   if (GCMetadataPrinters) {

Patch for debugging Address Space issues.

diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 682157b..d2a5c4a 100644
--- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -766,6 +766,8 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
     unsigned Offset = MFI->allocateLDSGlobal(DL, *GV);
     return DAG.getConstant(Offset, SDLoc(Op), Op.getValueType());
+  default:
+       printf("%s: address space type=%d\n", __func__, G->getAddressSpace());

   const Function &Fn = *DAG.getMachineFunction().getFunction();

OpenCL Bandwidth Test (AMD App SDK)

Intel (Beignet GPGPU Driver)

Platform found : Intel

Device  0            Intel(R) HD Graphics Haswell GT2 Desktop
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.11  ns
Page fault           531.44  ns
CPU read             15.31 GB/s
memcpy()             15.54 GB/s
memset(,1,)          26.54 GB/s
memset(,0,)          27.06 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

1. Host mapped write to inputBuffer
 clEnqueueMapBuffer -- WRITE (GBPS)     | 9513.290
 memset() (GBPS)                        | 24.746
 clEnqueueUnmapMemObject() (GBPS)       | 6176.168

2. GPU kernel read of inputBuffer
 clEnqueueNDRangeKernel() (GBPS)        | 38.225

 Verification Passed!

3. GPU kernel write to outputBuffer
 clEnqueueNDRangeKernel() (GBPS)        | 26.198

4. Host mapped read of outputBuffer
 clEnqueueMapBuffer -- READ (GBPS)      | 9830.400
 CPU read (GBPS)                        | 15.431
 clEnqueueUnmapMemObject() (GBPS)       | 10485.760

 Verification Passed!


AMD Radeon (OpenCL) 

Platform found : Mesa

Device  0            AMD POLARIS10 (DRM 3.2.0 / 4.7.0-rc7-meow+, LLVM 4.0.0)
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.12  ns
Page fault           538.31  ns
CPU read             12.19 GB/s
memcpy()             11.38 GB/s
memset(,1,)          20.93 GB/s
memset(,0,)          22.98 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

1. Host mapped write to inputBuffer
 clEnqueueMapBuffer -- WRITE (GBPS)     | 7586.161
 memset() (GBPS)                        | 6.369
 clEnqueueUnmapMemObject() (GBPS)       | 12822.261

2. GPU kernel read of inputBuffer
 clEnqueueNDRangeKernel() (GBPS)        | 113.481

 Verification Passed!

3. GPU kernel write to outputBuffer
 clEnqueueNDRangeKernel() (GBPS)        | 105.898

4. Host mapped read of outputBuffer
 clEnqueueMapBuffer -- READ (GBPS)      | 9.559
 CPU read (GBPS)                        | 17.179
 clEnqueueUnmapMemObject() (GBPS)       | 4060.750

 Verification Passed!


Thursday, March 10, 2016

Fuzzing Vulkans, how do they work?


Disclaimer: I have not yet fully read the specs on SPIR-V or Vulkan.

I decided to find out how hard it is to crash code working with SPIR-V.
Initially I wanted to crash the actual GPU drivers but for a start I decided
to experiment with GLSLang.

What I got

I used the "afl-fuzz" fuzzer to generate test cases that could crash
the parser. I have briefly examined the generated cases.
I have uploaded the results (which contain the SPIR-V binaries causing
the "spirv-remap" to crash) to the [following location](

Some of them trigger assertions in the code (which is not bad, but perhaps
returning an error code and shutting down cleanly would be better).
Some of them cause the code to hang for long time or indefinitely (which is worse
especially if someone intends to use the SPIR-V parser code online in the app).

Perhaps some of the results marked as "hangs" just cause too long compilation
time and could produce more interesting results if the timeout in "afl-fuzz"
is increased.

Two notable examples causing long compilation time are:
"out/hangs/id:000011,src:000000,op:flip1,pos:538" - for this one I waited for
a minute but it stil did not complete the compilation while causing 100% CPU load.

A log output of "glslang" indicating that most of the error cases found are handled, but with "abort" instead of graceful shutdown.


I have also tried using these shaders with the NVIDIA driver (since it was the only hardware I could run a real Vulkan driver on).

I have used the "instancing" demo from [SaschaWillems Repository]( .
I patched it to accept the path to the binary shader via the command line.
Next, I fed it with the generated
test cases. Some of them triggered segfaults inside the NVIDIA driver.
What is curious is that when i used the "hangs" examples, they also caused
the NVIDIA driver to take extra long time to compile and eventually crash
at random places.

I think it indicates either that there is some common code between the driver
and GLSLang (the reference implementation) or the specification is missing
some sanity check somewhere and the compiler can get stuck optimizing certain
Is there a place in specification that mandates that all the values are
checked to be within the allowed range, and all complex structures (such as
function calls) are checked recursively?

Perhaps I should have a look at other drivers (Mali anyone?).

[ 3672.137509] instancing[26631]: segfault at f ip 00007fb4624adebf sp 00007ffefd72e100 error 4 in[7fb462169000+1303000]
[ 3914.294222] instancing[26894]: segfault at f ip 00007f00b28fcebf sp 00007ffdb9bab980 error 4 in[7f00b25b8000+1303000]
[ 4032.430179] instancing[27017]: segfault at f ip 00007f7682747ebf sp 00007fff46679bf0 error 4 in[7f7682403000+1303000]
[ 4032.915849] instancing[27022]: segfault at f ip 00007fb4e4099ebf sp 00007fff3c1ac0f0 error 4 in[7fb4e3d55000+1303000]
[ 4033.011699] instancing[27023]: segfault at f ip 00007f7272900ebf sp 00007ffdb54261e0 error 4 in[7f72725bc000+1303000]
[ 4033.107939] instancing[27025]: segfault at f ip 00007fbf0353debf sp 00007ffde4387750 error 4 in[7fbf031f9000+1303000]
[ 4033.203924] instancing[27026]: segfault at f ip 00007f0f9a6f0ebf sp 00007ffff85a9dd0 error 4 in[7f0f9a3ac000+1303000]
[ 4033.299138] instancing[27027]: segfault at 2967000 ip 00007fcb42cab720 sp 00007ffcbad45228 error 6 in[7fcb42c26000+19f000]
[ 4033.394667] instancing[27028]: segfault at 36d2000 ip 00007efc789eb720 sp 00007fff26c636d8 error 6 in[7efc78966000+19f000]
[ 4033.490918] instancing[27029]: segfault at 167b15e170 ip 00007f3b02095ec3 sp 00007ffd768cbf68 error 4 in[7f3b01cbe000+1303000]
[ 4033.586699] instancing[27030]: segfault at 2ffc000 ip 00007fdebcc06720 sp 00007fff4fe59bd8 error 6 in[7fdebcb81000+19f000]
[ 4033.682939] instancing[27031]: segfault at 8 ip 00007fb80e7eed50 sp 00007ffe9cd21de0 error 4 in[7fb80e410000+1303000]
[ 4374.480872] show_signal_msg: 27 callbacks suppressed
[ 4374.480876] instancing[27402]: segfault at f ip 00007fd1fc3cdebf sp 00007ffe483ff520 error 4 in[7fd1fc089000+1303000]
[ 4374.809621] instancing[27417]: segfault at 2e0c3910 ip 00007f39af846e96 sp 00007ffe1c6d8f10 error 6 in[7f39af46f000+1303000]
[ 4374.905112] instancing[27418]: segfault at 2dc46a68 ip 00007f7b9ff7af32 sp 00007fff290edf00 error 6 in[7f7b9fba2000+1303000]
[ 4375.001019] instancing[27419]: segfault at f ip 00007f5a4e066ebf sp 00007ffe0b775d70 error 4 in[7f5a4dd22000+1303000]
[ 4375.096894] instancing[27420]: segfault at f ip 00007f7274d49ebf sp 00007ffe96fdea10 error 4 in[7f7274a05000+1303000]
[ 4375.193165] instancing[27421]: segfault at f ip 00007fa3bf3c8ebf sp 00007ffc4117e8d0 error 4 in[7fa3bf084000+1303000]
[ 4375.288969] instancing[27423]: segfault at f ip 00007f50e0327ebf sp 00007ffc02aa1d50 error 4 in[7f50dffe3000+1303000]
[ 4375.385530] instancing[27424]: segfault at f ip 00007f0d9a32eebf sp 00007ffd0298eb40 error 4 in[7f0d99fea000+1303000]
[ 4375.481829] instancing[27425]: segfault at f ip 00007f8400bc5ebf sp 00007ffef0334240 error 4 in[7f8400881000+1303000]
[ 4375.576983] instancing[27426]: segfault at 2dec2bc8 ip 00007f52260afec3 sp 00007fffd2bd1728 error 4 in[7f5225cd8000+1303000]

How to reproduce

Below are the steps I have taken to crash the "spirv-remap" tool.
I believe this issue is worth looking at because some vendors may
choose to build their driver internals based on the reference implementations
which may lead to bugs directly crippling into the software as-is.

0. I have used the Debian Linux box. I have installed the "afl-fuzz" tool,
and also manually copied the Vulkan headers to "/usr/include".

1. cloned the GLSLang repository
git clone
cd glslang

2. Compiled it with afl-fuzz
mkdir build
cd build/
cmake -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++ ..
cd ..

3. Compiled a sample shader from the GLSL to the SPIR-V format using
./build/install/bin/glslangValidator -V -i Test/spv.130.frag

4. Verified that the "spirv-remap" tool works on the binary
./build/install/bin/spirv-remap -v -i frag.spv -o /tmp/

5. Fed the SPIR-V binary to the afl-fuzz
afl-fuzz -i in -o out ./build/install/bin/spirv-remap -v -i @@ -o /tmp

6. Quickly discovered several crashes. I attach the screenshot of afl-fuzz
in the work.

7. Examined them.

First, I made a hex diff of the good and bad files. The command to generate
the diff is the following:
for i in out/crashes/*; do hexdump in/frag.spv > in.hex && hexdump $i > out.hex && diff -Naur in.hex out.hex; done > hex.diff

Next, I just ran the tool on all cases and generated the log of crash messages.
for i in out/crashes/*; do echo $i && ./build/install/bin/spirv-remap -i $i -o /tmp/ 2>&1 ;done > abort.log


Well, there are two obvious conclusions:
1. Vulkan/SPIR-V is still a WIP and drivers are not yet perfect
2. GPU drivers have always been notorious for poor compilers - not only codegen, but parsers and validators. Maybe part of the reason is that CPU compilers simply handle more complex code and therefore more edge cases have been hit already.

Thursday, February 25, 2016

Notes on firmware and tools


In this blog post I mainly want to summarize my latest experiments at using clang's static analyzer and some thoughts on what could be further done at the analyzer, and open-source software quality in general. It's mostly some notes I've decided to put up.

Here are the references to the people involved in developing clang static analyzer at Apple. I recommend following them on twitter and also reading their presentation on developing custom checkers for the analyzer.
Ted Kremenek -
Anna Zaks -
Jordan Rose -

"How to Write a Checker in 24 Hours" -
"Checker Developer Manual" - - This one requires some understanding of LLVM, so I recommend to get comfortable with using the analyzer and looking at AST dumps first.

There are not so many analyzer plugins which not made at Apple.


Out of curiosity I ran the clang analyzer on glibc. Setting it up was not a big deal - in fact, all that was needed was to just run the scan-build. It did a good job of intercepting the gcc calls and most of the sources were successfully analyzed. In fact, I did not do anything special, but even running the analyzer with the default configuration revealed a few true bugs, like the one showed in the screenshot.

For example, the "iconv" function has a check to see if the argument "outbuf" is NULL, which indicates that it is a valid case expected by the authors. The manpage for the function also says that passing a NULL argument as "outbuf" is valid. However, we can see that one of the branches is missing the similar check for NULL pointer, which probably resulted from a copy-paste in the past. So, passing valid pointers to "inbuf" and "inbytesleft" and a NULL pointer for "outbuf" leads to the NULL pointer dereference and consequently a SIGSEGV.

Fun fact: my example is also not quite correct, as pointed out by my Twitter followers. The third argument to iconv must be a pointer to the integer, not the integer. However, on my machine my example crashes at dereferencing the "outbuf" pointer and not the "inbytesleft", because the argument evaluation order in C is not specified, and on x86_64 arguments are usually pushed to the stack (and therefore evaluated) in the reverse order.

Is it a big deal? Who knows. On the one hand, it's a userspace library, and unlike OpenSSL, no one is likely to embed it into kernel-level firmware. On the other, I may very well imagine a device such as a router or a web kiosk where this NULL pointer dereference could be triggered, because internalization and text manipulation is always a complex issue.

Linux Kernel.

I had this idea of trying to build LLVMLinux with clang for quite a while, but never really had the time to do it. My main interest in doing so was using the clang static analyzer.

Currently, some parts of Linux Kernel fail to build with clang, so I had to use the patches from the LLVMLinux project. They failed to apply cleanly though. I had to manually edit several patches. Another problem is that "git am" does not support the "fuzzy" strategy when applying the patches, so I had to use a certain script found on GitHub which uses "patch" and "git commit" to do the same thing.

I have pushed my tree to github. I based it off the latest upstream at the time when I worked on it (which was the start of February 2016). The branch is called "4.5_with_llvmlinux".

I've used the following commands to get the analysis results. Note that some files have failed to compile, and I had to manually stop the compilation job for one file that took over 40 minutes.

export CCC_CXX=clang++
export CCC_CC=clang
scan-build make CC=clang HOSTCC=clang -j10 -k


Porting clang instrumentations to major OpenSource projects.

Clang has a number of instrumentation plugins called "Sanitizers" which were largely developed by Konstantin Serebryany and Dmitry Vyukov at Google.

Arguably the most useful tool for C code is AddressSanitizer which allows to catch Use-After-Free and Out-of-Bounds access on arrays. There exists a port of the tool for the Linux Kernel, called Kernel AddressSanitizer, or KASAN, and it has been used to uncover a lot of memory corruption bugs, leading to potential vulnerabilities or kernel panics.
Another tool, also based on compile-time instrumentation, is the ThreadSanitizer for catching data races, and it has also been ported to the Linux Kernel.

I think it would be very useful to port these tools to the other projects. There are a lot of system-level software driving the critical aspects of system initialization process. To name a few:
  • EDK2 UEFI Development Kit
  • LittleKernel bootloader by Travis Geiselbrecht. It has been extensively used in the embedded world instead of u-boot lately. Qualcomm (and Codeaurora, its Open-Source department) is using a customized version of LK for its SoCs, and nearly every mobile phone with Android has LK inside it. NVIDIA have recently started shipping a version of LK called "TLK" or "Trusted Little Kernel" in its K1 processors, with the largest deployment yet being the Google Nexus 9 tablet.
  • U-boot bootloader, which is still used in a lot of devices.
  • FreeBSD and other kernels. While I don't know how widely they are deployed today, it would still be useful, at least for junior and intermediate kernel hackers.
  • XNU Kernel powering Apple iOS and OS X. I have some feeling that Apple might be using some of the tools internally, though the public snapshots of the source are heavily stripped of newest features.
Btw, if anyone is struggling at getting userspace AddressSanitizer working with NVIDIA drivers, take a look at this thread where I posted a workaround. Long story short, NVIDIA mmaps memory at large chunks up to a certain range, and you need to change the virtual address which ASan claims for itself.

Note that you will likely get problems if you build a library with a customized ASan and link it to something with unpatched ASan, so I recommend using a distro where you can rebuild all packages simultaneously if you want to experiment with custom instrumentation, such as NixOS, Gentoo or *BSD.

As I have mentioned before, there is a caveat with all these new instrumentation tools - most of them are made for the 64-bit architecture while the majority of the embedded code stays 32-bit. There are several ways to address it, such as running memory checkers in a separate process or using a simulator such as a modified QEMU, but it's an open problem.

Other techniques for firmware quality

In one of my previous posts I have drawn attention to gathering code coverage in low-level software, such as bootloaders. With GCOV being quite easy to port, I think more projects should be exposed to it. Also, AddressSanitizer now comes with a custom infrastructure for gathering code coverage, which is more extensive than GCOV.

Here's a link to the post which shows how one can easily embed GCOV to get coverage using Little Kernel as an example.

While userspace applications have received quite a lot of fuzzing recently, especially with the introduction of the AFL fuzzer tool, kernel-side received little attention. For Linux, there is a syscall fuzzer called Trinity.

There is also an interesting project at fuzzing kernel through USB stack (vUSBf).

What should be done is adapting this techniques for other firmware projects. On the one hand, it might get tricky due to the limited debugging facilities available in some kernels. On the other one, the capabilities provided by simulators such as QEMU are virtually unlimited (pun unintended).

An interesting observation might be that there are limited sources of external data input on a system. They include processor interrupt vectors/handlers and MMIO hardware. As for the latter, in Linux and most other firmwares, there are certain facilities for working with MMIO - such as functions like "readb()", "readw()" and "ioremap". Also, if we're speaking of a simulator such as QEMU, we can identify memory regions of interest by walking the page table and checking the physical address against external devices, and also checking the access type bits - for example, the data marked as executable is more likely to be the firmware code, while the uncached contiguous regions of memory are likely DMA windows.

ARM mbed TLS library is a good example of a project that tries to integrate as many dynamic and static tools into its testing process. However, it can be built as a userspace library on desktop, which makes it less interesting in the context of Firmware security.

Another technique that has been catching my attention for a lot of time is symbolic execution. In many ways, it is a similar problem to the static analysis - you need to find a set of input values constrained by certain equations to trigger a specific execution path leading to the incorrect state (say, a NULL pointer dereference).

Rumors are, a tool based on this technique called SAGE is actively used at Microsoft Research to analyze internal code, but sadly there are not so many open-source and well-documented tools one can easily play with and directly plug into an existing project.

An interesting example of applying this idea to the practical problem at a large and established corporation (Intel) is presented in the paper called "Symbolic execution for BIOS security" which tries to utilize symbolic execution with KLEE for attacking firmware - the SMM handler. You can find more interesting details in the blog of one of the authors of the paper - Vincent Zimmer ( and

Also, not directly related, but an interesting presentation about bug hunting on Android.

I guess now I will have to focus on studying recent exploitation techniques used for hacking XNU, Linux and Chromium to have a more clear picture of what's needed to achieve :)

Further developing clang analyzer.

One feature missing from clang is the ability to perform the analysis across the translation units. A Translation Unit (or TU) in clang/LLVM roughly represents a file on the disk being parsed. Clang analyzer is implemented as a pass which traverses the AST and is limited to the one translation unit. Which is not quite true - it will go and analyze the includes recursively. But if you have two separate "C" files, which do not include one another, and a function from one file is calling the function from another one, the analysis will not work.

Implementing a checker across multiple sources is not a trivial thing, as pointed out by the analyzer authors in the mailing list, though this topic is often brought up by different people.

Often, it is possible to come up with a workarounds, especially if one aims at implementing ad-hoc checks for their project. The simplest one would be creating a top-level "umbrella" C file which would include all files with implementations of the functions. I have seen some projects do exactly this. The most obvious shortcomings of this approach is that it will require redesigning the whole structure of the project, and will not work if some of the translation units need custom C compiler options.

Another option would be to dump/serialize the AST and any additional information during the compilation of each TU and process it after the whole project is built. It looks like this approach has been proposed multiple times on the mailing list, and there exist at least one paper which claims doing that.

"Towards Vulnerability Discovery Using Extended Compile-time Analysis" -

In fact, the analyzer part itself might very well be independent of the actual parser, and could be reused, for example, to analyze the data obtained by the disassembly, but it's a topic for another research area.

Developing ad-hoc analyzers for certain projects.

While it is very difficult to statically rule out many kinds of errors in an arbitrary project, either due to state explosion problem or the dynamic nature of the code, quite often we can design tools to verify some contracts specific to a certain project.

An example of such tool would be a tool called "sparse" from the Linux Kernel. It effectively works as a parser for the C code and can be made to run on every C file compiled by the GCC while building the kernel. It allows to specify some annotations to the declarations in the C code. It works similar to how attributes were implemented in GCC and Clang later.

One notable example of the code in the Linux Kernel, which deals with passing void pointers around and relying on pointer trickery via the "container_of" macro is the workqueue library.

While working on the FreeBSD kernel during the GSoC in 2014, I faced a similar problem while developing device drivers - at certain places pointers were cast to void, and casted back to typed ones where needed. Certainly it's easy to make a mistake while coding these things.

Now, if we dump enough information during compilation, we can implement advanced checks. For example, when facing a function pointer which is initialized dynamically, we can do two things. First, find all places where it can potentially be initialized. Second, find all functions with a matching prototype. While checking all of them might be time consuming and generate false positives, it will also allow to check more code statically at compilation time.

A notable source of problems when working with C code is that linking stage is traditionally separated from the compilation stage. The linker usually manipulates the abstract "symbols" which are just void pointers. Even though it could be possible to store enough information about the types in a section of the ELF (in fact, DWARF debugging data contains information about the types) and use it to type-check symbols when linking, it's not usually done.

It leads to certain funky problems. For example, aliases (weak aliases) are a linker-time feature. If one defines an alias to some function, where the type of the alias does not match the type of the original function, the compiler cannot check it (well, it could if someone wrote a checker, but it does not), and you will get a silent memory corruption at runtime. I once ran into this issue when porting the RUMP library which ships a custom LIBC, and our C library had different size for "off_t".


There are two ideas I had floating around for quite a while.

Microkernelizing Linux.

An interesting research problem would be coming up with a tool to automatically convert/rewrite linux-kernel code into a microkernel with device drivers and other complex pieces of code staying in separate protection domains (processes).

It is interesting for several reasons. One is that a lot of microkernels, such as L4, rely on DDE-Kits to run pieces of Linux and other systems (such as NetBSD in case of RUMP) to provide drivers. Another is that there's a lot of tested code with large functionality, which could possibly made more secure by minimizing the impact of memory corruption.

Besides obvious performance concerns there are a lot of research questions to this.

  • Converting access to global variables to IPC accesses. Most challenging part would be dealing with function pointers and callbacks.
  • Parsing KConfig files and "#ifdef" statements to ensure all conditional compilation cases are covered when refactoring. This in itself is a huge problem for every C codebase - if you change something in one branch of an "#ifdef" statement, you cannot guarantee you didn't break it for another branch. To get whole coverage, it would be useful to come up with some automated way to ensure all configurations of "#ifdef" are built.
  • Deciding which pieces of code should be linked statically and reside in the same process. For example, one might want to make sound subsystem, drm and vfs to run as separate processes, but going as far as having each TU converted to a process would be an overkill and a performance disaster.

Translating C code to another language. Not really sure if it could be really useful. It is likely that any complex code involving pointers and arrays would require a language with similar capabilities as a target (if we're speaking of generating a useful and readable idiomatic code, and not just a translator such as emscripten). Therefore, the target language might very well have the same areas with unspecified behavior. Some people have proposed for a stricter and a more well-defined dialect of C.

One may note that it is not necessary to use clang for any of these tasks. In fact, one can get away with writing a custom parser or hacking GCC. These options are perfectly legal. I had some of these ideas floating around for at least three years, but back then I didn't have skills to build GCC and hack on it, and now I've been mostly using clang, but it's not a big deal.


Overall, neither improving the compiler nor switching to another language alone would not save us from errors. What's often overlooked is the process of the continuous integration, with regression tests to show what became broken, with gathering coverage, with doing integration tests.
There are a lot of non-obvious parameters which are difficult to measure. For example, how to set up a regression test that would detect a software breakage due to a new compiler optimization.

Besides, we could possibly borrow a lot of practices from the HDL/Digital Design industry. Besides coverage, an interesting idea is simulating the same design in different languages with different semantics to hope that if there are mistakes at the implementation stage, they are not at the same places, and testing will show where outputs of two systems diverge.


Ping me if you're interested in working on the topics mentioned above :)

Monday, January 25, 2016

On GPU ISAs and hacking

Recently I've been reading several interesting posts on hacking Windows OpenGL drivers to exploit the "GL_ARB_get_program_binary" extension to access raw GPU instructions.

Reverse engineering is always cool and interesting. However, a lot of these efforts have been previously done not once because people directed their efforts at making open-source drivers for linux, and I believe these projects are a good way to jump-start into GPU architecture without reading huge vendor datasheets.

Since I'm interested in the GPU drivers, I decided to put up links to several open-source projects that could be interesting to fellow hackers and serve as a reference, potentially more useful than vendor documentation (which is not available for some GPUs. you know, the green ones).

Please also take a look at this interesting blog post for the detailed list of GPU datasheets -

* AMD R600 GPU Instruction Set Manual. It is an uber-useful artifact because many modern GPUs (older AMD desktop GPUs, Qualcomm Adreno GPUs and Xbox 360 GPU) bear a lot of similarities with AMD R300/R600 GPU series.

Xenia Xbox 360 Emulator. It features a Dynamic Binary Translation module that translates the AMD instruction stream into GLSL shaders at runtime. Xbox 360 has an AMD Radeon GPU, similar to the one in R600 series and Adreno

Let's look at "src/xenia/gpu". Warning: the code at this project seems to be changing very often and the links may quickly become stale, so take a look yourself.
One interesting file is "" which has register definitions for MMIO space allowing you to figure out a lot about how the GPU is configured.
The most interesting file is "src/xenia/gpu/ucode.h which contains the structures to describe ISA.
src/xenia/gpu/ contains the logic to disassemble the actual GPU packet stream at binary level
src/xenia/gpu/ contains the logic to translate the GPU bytecode to the IR
src/xenia/gpu/ the code to translate Xenia IR to GLSL
Part of the Xbox->GLSL logic is also contained in "src/xenia/gpu/gl4/" file.

* FreeDreno - the Open-Source Driver for the Qualcomm/ATI Adreno GPUs. These GPUs are based on the same architecture that the ATI R300 and consequently AMD R600 series, and both ISAs and register spaces are quite similar. This driver was done by Rob Clark, who is a GPU driver engineer with a lot of experience who (unlike most FOSS GPU hackers) previously worked on another GPU at a vendor and thus had enough experience to successfully reverse-engineer a GPU.

Take a look at "src/gallium/drivers/freedreno/a3xx/fd3_program.c" to have an idea how the GPU command stream packet is formed on the A3XX series GPU which is used in most Android smartphones today. Well, it's the same for most GPUs - there is a ring-buffer to which the CPU submits the commands, and from where the GPU reads them out, but the actual format of the buffer is vendor and model specific.

I really love this work and this driver because it was the first open-source driver for mobile ARM SoCs that provided full support for OpenGL ES and non-ES enough to run Gnome Shell and other window managers with GPU accelleration and also supports most Qualcomm GPUs.

Mesa. Mesa is the open-source stack for several compute and graphic APIs for Linux: OpenGL, OpenCL, OpenVG, VAAPI and OpenMAX.

Contrary to the popular belief, it is not a "slow" and "software-rendering" thing. Mesa has several backends. "Softpipe" is indeed a slow and CPU-side implementation of OpenGL, "llvmpipe" is a faster one using LLVM for vectorizing. But the most interesting part are the drivers for the actual hardware. Currently, Mesa supports most popular GPUS (with the notable exception of ARM Mali). You can peek into Mesa source code to get insight into how to control both the 2D/ROP engines of the GPUs and how to upload the actual instructions (the ISA bytecode) to the GPU. Example links are below. It's a holy grail for GPU hacking.

"" contains the code to emit I915 instructions.
The code for the latest GPUs - Broadwell - is in the "i965" driver.

While browsing Mesa code, one will notice the mysterious "bo" name. It stands for "Buffer Object" and is an abstraction for the GPU memory region. It is handled by the userpace library called "libdrm" -

Linux Kernel "DRM/KMS" drivers. KMS/DRM is a facility of Linux Kernel to handle GPU buffer management and basic GPU initialization and power management in kernel space. The interesting thing about these drivers is that you can peek at the places where the instructions are actually transferred to the GPU - the registers containing physical addresses of the code.
There are other artifacts. For example, I like this code in the Intel GPU driver (i915) which does runtime patching of GPU instruction stream to relax security requirements.

* Beignet - an OpenCL implementation for Intel Ivy-Bridge and later GPUs. Well, it actually works.
The interesting code to emit the actual command stream is at

* LibVA Intel H.264 Encoder driver. Note that the proprietary MFX SDK also uses the LibVA internally, albeit a patched version with newer code which typically gets released and merged into upstream LibVA after it goes through Intel internal testing.
An interesting artifact are the sources with the encoder pseudo-assembly.

Btw, If you want zero-copy texture sharing to pass OpenGL rendering directly to the H.264 encoder with either OpenGL ES or non-ES, please refer to my earlier post - and to Weston VAAPI recorder.

* Intel KVM-GT and Xen-GT. These are GPU virtualization solutions from Intel. They emulate the PCI GPU in the VM. They expose the MMIO register space fully compatible with the real GPU so that the guest OS uses the vanilla or barely modified driver. The host contains the code to reinterpret the GPU instruction stream and also to set up the host GPU memory space in such a way that it's partitioned equally between the clients (guest VMs).

There are useful papers, and interesting pieces of code.

* Intel GPU Tools
A lot of useful tools. Most useful one IMHO is the "intel_perf_counters" which shows compute unit, shader and video encoder utilization. There are other interesting tools like register dumper.

It also has the Broadwell GPU assembler

And the ugly Shaders in ugly assembly similar to those in LibVA

I once wrote the tool complimentary to the register dumper - to upload the register dump back to the GPU registers. I used it to debug some issues with the eDP panel initialization in my laptop

* LunarGLASS is a collection of experimental GPU middleware. As far as I understand it, it is an ongoing initiative to build prototype implementations of upcoming Khronos standards before trying to get them merged to mesa. And also to allow the code to be used with any proprietary GPU SW stack, not tied to Mesa. Apparently it also has the "GLSL" backend which should allow you to try out SPIR-V on most GPU stacks.

The main site -
The SPIR-V frontend source code -
The code to convert LunarG IR to Mesa IR -

* Lima driver project. It was a project for reverse-engineering the ARM Mali driver. The authors (Luc Verhaegen aka "libv" and Connor Abbot) have decoded some part of the initialization routine and ISA. Unfortunately, the project seems to have stopped due to legal issues. However, it has significant impact on the open-source GPU drivers, because Connor Abbot went on to implement some of the ideas (namely, the NIR - "new IR") in the Mesa driver stack.

Some interesting artifacts:
The place where the Geometry Processor data (attributes and uniforms) are written to the GPU physical memory:
"Render State" - a small structure describing the GPU memory, including the shader program address.

* Etnaviv driver project. It was a project to reverse-engineer the Vivante GPU used by Freescale I.MX SoC series. It was done by Wladimir J. van der Laan. The driver has reached the mature state, successfully running games like Quake. Now, there is a version of Mesa with this driver built-in.