Wednesday, May 29, 2019

Reverse-engineering Samsung Exynos 9820 bootloader and TZ

Reverse-engineering Samsung S10 TEEGRIS TrustZone OS

It's been a while since my last post, huh?
Even though I have quite a lot of stuff I'm planning to write about, time is very limited.

Lately I've been working on reverse engineering and documenting
the S-Boot bootloader and TrustZone OS from the Exynos version
of Samsung Galaxy S10.
TLDR: I can now run S-Boot and TEEGRIS TrustZone TAs in QEMU but too lazy to find bugs.

It's been a while since I had a Samsung phone, my last was Galaxy S2.
It's also been a while since I last looked into bootloader binaries.

Last year I got an Exynos S9 model, mostly because I was impressed by its
CPU benchmark scores and wanted to run my own code to measure it.
This year I got some spare time but since S10 came out and a lot of people
have already looked at S9 software, I've decided to start reverse engineering
the software from S10.

S-Boot bootloader image layout.

github gist


  • 0x0: probably EPBL (early primitive bootloader) with some USB support
  • 0x13C00: ACPM (Access Control and Power Management?)
  • 0x27800: some PM-related code
  • 0x4CC00: some tables with PM parameters
  • ... -> either charger mode code or PMIC firmware
  • 0xA4000: BL2, the actual s-boot
  • 0x19E000: TEEGRIS SPKG (CSMC)
  • 0x19E02B: TEEGRIS SPKG ELF start (cut from here to load into the dissasembler). This probably stands for "Crypto SMC" or "Checkpoint SMC". This handles some SMC calls from the bootloader as part of Secure Boot for Linux.
  • 0x1ACE00: TEEGRIS SPKG (FP_CSMC)
  • 0x1ACE2B: TEEGRIS FP_CSMC (ELF header). My guess is that it's related to the Fingerprint sensor because all it does is set some registers in the GPIO block and USI block (whatever it is).
  • 0x264000: TEEGRIS kernel, relocate to 0xfffffffff0000000 to resolve relocations
  • 0x29e000: EL1 VBAR for TEEGRIS kernel. fffffffff0041630: syscall table, first entry is zero.
  • 0x2D4000: startup_loader package
  • 0x2D4028: startup_loader ELF start. This one's invoked by S-Boot to read the TEEGRIS kernel either from Linux kernel via shared memory or from the LZ4 archive compiled into S-Boot.

There's also one encrypted region containing ARM Trusted Firmware which is EL3 monitor code. It's right after the bunch of Rijndael substitution box constants.

Running S-Boot in QEMU.

I've long wanted to run S-Boot in QEMU for reverse engineering it.
I think I've mentioned this idea to my colleague Fred 2 years ago which kind of motivated him to write this great post about Exynos4210 early bootloader in SROM.
Check out his blog if you're interested in Samsung, btw.
https://fredericb.info/2018/03/emulating-exynos-4210-bootrom-in-qemu.html

Long story short, with a bit of hacks to emulate some MMIO peripherals I've prepared the patch for QEMU to run S-Boot from Exynos9820.
QEMU Support for Exynos9820 S-Boot

SCTLR_EL3 register

According to ARM ARM, top half of SCTLR is Undefined.
Samsung reused them to store the base address for the S-Boot bootloader.
When running in EL3, part of SCTLR is used when computing the value to write to VBAR registers which point to the Exception Table.
I initially attempted running S-Boot in EL3 but it checks EL at runtime and I believe it's actually running at EL1 but the binary supports EL1, EL2 and EL3.

Re-enabling debugging prints

Turns out, early in the boot process the bootloader disables most of the debugging logging.
I've prepared the GDB script to work around that.
gdbscript
set *(int*)0x8f16403c = 0

UART

https://github.com/astarasikov/qemu/blob/exynos9820/hw/arm/virt.c#L1900
As usual (WM5 blog [http://allsoftwaresucks.blogspot.com/2016/10/running-without-arm-and-leg-windows.html]), we can solve it by making the MMIO Read request return different data on subsequent reads.
We simply invert the value in cache on each invokation.
Using this trick we can bypass busy loops which wait for some bits to be set or cleared.

In fact, emulating two UART registers, status and TX, is enough to get debugging output from the bootloader.

Peripherals

We can identify some peripherals either by looking up their addresses in Linux Device Tree files
or by analysing what is done by the code that accesses them.
For example, we can easily identify Timer registers.

EL3 Monitor emulation.


S-Boot calls into the Monitor code (ARM Trusted Firmware) to do some crypto and EFUSE-related operations.
These calls have argument numbers starting with a lot of FFFFFF.
It was necessary to enable the "PSCI conduit" in QEMU which intercepts some SMC calls and add a simple
handler to allow S-Boot to properly start without crashing.
arm_is_psci_call
if ((param & 0xfffff000) == 0xfffff000) {
//Exynos SROM 
return true;

Putting all the pieces together: running it.

./aarch64-softmmu/qemu-system-aarch64 -m 2048 -M virt -serial stdio -bios ~/Downloads/sw/pda/s10/BL/sboot_bl2.bin   -s -S 2>/dev/null

At this point, we're not emulating most peripherals like I2C, PMIC, USB.
However, the bootloader gets to the point where memory allocator and printing subsystem is initialized which should be enough
to fuzz-test some parsers if we hook UFS/MMC access functions.


General approach to reverse-engineering

Samsung leaves a lot of debugging prints in their binaries.
Even in the RKP hypervisor, although most strings are obfuscated by getting replaced with their hashes,
some strings in the exception handler are not obfuscated at all.
With this knowledge, it's easy to identify the logging function, snprintf
and then strcpy, memcpy. Memcpy and strcpy are often near malloc and free.
Knowing this functions it's trivial to reverse-engineer the rest.

TEEGRIS intro

In the Exynos version of Galaxy S10, Samsung have replaced
the TrustZone OS from MobiCore with their solution called TEEGRIS.

As we've seen before, TEEGRIS kernel and loader are located inside
the BL image along with S-Boot.
Userspace portion - dynamic libraries and TAs (Trusted Applications)
reside in two locations:

  • System partition ("/system/tee"):
  • A TAR-like archive linked into the Linux Kernel
Here is what we can find:

  • 00000000-0000-0000-0000-4b45594d5354 (notice how 4b 45 49 4d 53 54 are ASCII codes for "KEYMST" (Key Master))
  • 00000000-0000-0000-0000-564c544b5052 VLTKPR (Vault Keeper)
  • 00000005-0005-0005-0505-050505050505 - TSS (TEE Shared Memory Server?)
  • 00000007-0007-0007-0707-070707070707 - ACSD (Access Control and Signing Driver?) basically the loader for TAs with a built-in X.509 parser 


I wrote a Python script to unpack the (uncompressed) TZAR files.
https://gist.github.com/astarasikov/f47cb7f46b5193872f376fa0ea842e4b#file-unpack_startup_tzar-py
After unpacking the file "startup.tzar" from S10 kernel tree (LINK)
we can see that it contains a bunch of libraries as well as two TEE applications
which can be identified by their file names resembling GUIDs.

Security mechanisms


  • Boot Time: TEEGRIS kernel and startup_loader reside in the same partition as S-Boot so their integrity should be checked by the early bootloader (in SROM).
  • Run Time: TrustZone applets (TAs) are authenticated using either built-in hashes or X.509 certificates.
  • Trustlets and TEEGRIS kernel has stack cookies and they are randomized.

All TAs are ELF files which export the symbol "TA_InvokeCommandEntryPoint" which
is where requests from Non-Secure EL1 (and other Secure EL0 TAs) are processed.
Additionally, some extra TZ applets can be found in the "system" partition.

Indentifying TEEGRIS syscalls

Attempt 1 (stupid)

Look for the syscall number and a compare instruction.
For example, for the "recv" syscall, let's search for 0x38, filter results by "cmp".
No Luck. Ok, it's probably using a jump table or a function pointer array instead.

Attempt 2

Let's locate AArch64 exception table and go from there.
We can find it by a bunch of NOPs (1f 20 03 d5) immediately after a block of zero-filled memory.
We can then find the actual exception handler for EL0 by knowing the offset from the ARM ARM.
https://developer.arm.com/docs/100933/latest/aarch64-exception-vector-table








P.S.
In fact, the code which launches "startup_loader" sets VBAR_EL1 to the same
address which we've identified before.

Syscalls

Luckily for us, Samsung put wrappers for each syscall into the library called "Libtzsl.so"
so we can easily recover the syscall names from the index in the table.

TEEGRIS IPC

Curiously, Samsung chose to implement two popular POSIX APIs to communicate
between TAs as well between TAs and REE (Linux): "epoll" and "sendmsg/recvmsg".

Peripherals such as I2C and RPMB are of course handled by file paths with magic
names, like on most UNIX-like kernels.

List of (most) TEEGRIS syscalls

https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11590

TEEGRIS emulator

Since I'm better at reverse engineering than at exploitation
and I like writing emulators but hate code review, I decided to
find a way to run TAs on the Linux laptop instead of the actual
device.

Besides doing full-system emulation, QEMU supports the "user" target.
In this case it loads the target ELF binary into memory and translates
instructions to the host architecture, but instead of blindly passing
syscall arguments to real syscalls it can patch them and do any kind of
emulation.

Here are the changes that I needed to make in order to run TEEGRIS binaries instead of Linux ones:

  • ELF Entrypoint: setup AUXVALs in a specific order that "bin/libtzld.so" expects
  • Slightly different ABI: register X7 is used for the syscall number for both ARM32 and ARM64
  • https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11785
  • TLS handling (QEMU bug?)

Current Status.


  • Boots TAs, both 32-bit and 64-bit
  • Currently does not support launching TAs from TAs (thread_create)
  • Currently only invalid command handler is reached. Need to improve
  • recvmsg or patch the library code as a workaround.
  • But overall it should be possible to build a fuzzer for TAs in less than a week of work now.



Here's one idea: now that we know we've emulated enough of syscalls
for a TA to boot and start message processing, we can just override the
return address and arguments for one of the syscalls which are invoked
in the message processing loop and redirect the execution directly
to TA_InvokeCommandEntryPoint.

For this proof of concept I've manually identified the entry point address
and adjusted it according to the ELF base load address and QEMU-specific load
offset. Of course it would be better to automate this part so that TA loader
is more generic but as every software engineer knows, those who write
good code get scooped by those who don't.



This kind of works in that we're getting messages from inside the TA: check the full log at https://pastebin.com/sVtWk5CD and search for "keymaster [ERR]".
However, it fails early when validating the message contents.
We need to generate the correct ASN.1 payload which should be doable
since ASN.1 grammar templates are compiled into the binary.

Ideas for future research


  • Hook malloc/free and some other functions and invoke native system C library calls.
  • Hook QEMU JIT (TCG) or interpreter to check memory accesses against ASAN shadow memory. This way we can enable Address Sanitizer for binary blobs, similarly to how Valgrind does memory debugging. Since QEMU Usermode runs TAs in the same address space as itself, we can use ASAN allocator or libdislocator to detect OOB memory access. Unicorn is kind of hard to use because for this because it does not allow to easily set up MMIO traps, it only allows to register chunks of normal memory.
  • Finish reverse-engineering ASN.1 format for Keymaster and fuzz this TA.
  • Run TEEGRIS kernel in QEMU as well to fuzz syscalls.
  • A Ghidra script to rename functions according to the debug strings passed to invokations of print callees
  • Look at the ring buffer implementation in the shared memory.

Running TEEGRIS Emulator

export TEE_CMD=777
qemu/teegris$ ../arm-linux-user/qemu-arm -z fuzz_keymaster/in/test0.bin -cpu max ./00000000-0000-0000-0000-4b45594d5354.elf

Debugging panics with GDB


Related Projects

Post from Daniel Komaromy on reverse-engineering Galaxy S8 which mostly focuses on the other part of the picture: getting from Linux into Secure EL0.


Blog from Blue Frost Security on reverse-engineering S9 TrustZone. The OS kernel is different but actual TAs are the same.

Monday, October 24, 2016

running without an ARM and a leg: Windows Mobile in QEMU

Introduction.

One project I had in mind long time ago was getting Windows Mobile to run in QEMU.
I think it's a lovely OS with a long history and the project seemed like a nice tecnhical challenge.

Initially I started working on it two years ago back in 2014 and the plan was to later run it in KVM on Cortex-A15 with Virtualization Extensions. However, I had to suspend it because I started working on two other challenges - running XNU on Xen (aka Virtu.al LLC) and later doing GSoC (running FreeBSD in ARM emulator).

Now since I've got some free time on my hands, I decided to finally get back to this project and cross it off my TODO list.

Choosing the emulation target.

In order to run Windows CE on QEMU (or any OS for that matter) it would be necessary to either develop a Board Support Package with all the drivers for a specific virtual machine or take the opposite approach and emulate some machine for which there already exists a ROM image.

For Windows, there is the emulator developed by Microsoft which is unsurprisingly called just Device Emulator. It emulates a real board - MINI2440 based on Samsung S3C2440 SoC which is an ancient ARMv4 CPU. Turns out, this is the same SoC that's used in OpenMoko so there is an old fork of QEMU with the support for most of the peripherals. So the choice of the platform seemed a no-brainer.

So I took the QEMU fork supporting MINI2440 and tried to adapt it to running the unmodified Windows Mobile images from Microsoft. Needless to say, I made sure the images are placed into memory at the correct addresses but the code seemed to crash spontaneously and never got past enabling MMU.

The first idea that comes to mind is of course to take the latest QEMU and see if it fixes anything. However, trying random changes until something works is actually quite a crappy approach.

So, I decided to single-step the execution and see what happens. QEMU provides the very useful GDB interface (which can be activated with the "-s -S" switches) for this purpose.

Caches are hard.

The first surprise came from the bootloader code. Before launching the kernel, Windows CE bootloader disables the MMU. At this moment QEMU crashes spectacularly. Initially I tried hacking around the issue by adding the code to translate the addresses into the "exec-all.h". However, it didn't solve the problem and the heuristic started looking too complex which suggested I'm on the wrong way.

I started thinking of and realized that disabling MMU is a tricky thing because on ARM Program Counter is usually ahead of the current instruction so the CPU has to fetch a couple instructions ahead. So we have a caching problem. In this Windows CE code, there is a NOP instruction between disabling the MMU and jumping to the new PA from a register. The hypothesis was that QEMU did not fetch the needed instructions correctly and it was necessary to add a translation cache for them. The reality was more funny.

As it turned out, QEMU contained a hack to cache a couple instructions because... because Linux kernel for PXA270 had the same problem in the opposite scenario - when the MMU was enabled. I decided to comment out the hack and it made the boot-up progress further. (PXA Cache Hack).

Stacks are not easy either.

Next thing I know is that soon after enabling the MMU the code crashes when trying to access the stack at a very peculiar virtual address of 0xffffce70. I examined the QEMU translation code and found out that it correctly locates the physical address but the permission bits are incorrect. I decided to force it to return RW access for the particular address range and voila - Windows Mobile boots to Home Screen successfully. (Hack to force RW stack).

Windows Mobile 5.0 Smartphone Edition on QEMU on Mac OS X.
Now that everything seemed to boot, I decided to take another look at the MMU issue and fix it properly. The first idea I had was to compare ARM920T (ARMv4) and ARM1136 (ARMv5) page table formats. It is worth noting that ARMv4 did not have TEX remap bits, and also last level page tables had different type (last two bits). It turned out that QEMU (probably as real ARM CPUs) supported all types of pages, even ARMv4 on ARMv6 target, and TEX/caching bits were simply ignored. After careful examination of the code and all the page table parsing I found a typo that was already fixed in the upstream QEMU (MMU Typo).

You can grab the cleaner tree without the ugly intermediate hacks: https://github.com/astarasikov/qemu/tree/wm5 . If you need to have a look at the older WIP stuff, it's at https://github.com/astarasikov/qemu/tree/wm5_with_hacks .

Running Pocket PC version.

Preparing PocketPC Image.

Windows Mobile Standard aka Smartphone comes shipped as the NOR flash image and it is XIP (Execute-in-Place). Pocket PC Image comes in a different form. It comes as an image already relocated to the RAM so we need to launch it directly via the "-kernel" parameter. We need to create the NOR memory image for it. We can use the smartphone image as a base, but an empty 32MB file should work as well.

  • Grab the "romtools" scripts at github: https://github.com/pinkavaj/romtools
  • python b000ff-to-bin.py ../ppc_50/_208PPC_USA_bin
  • dd if=_208PPC_USA_bin-80070000-81491ed0.bin of=SPHIDPI_BIN bs=$((0x30000)) seek=1 conv=notrunc
  • ./arm-softmmu/qemu-system-arm -show-cursor -m 256 -M mini2440 -serial stdio -kernel ../wm5/50_ppc/_208PPC_USA_bin-80070000-81491ed0.bin  -pflash ../wm5/50_ppc/SPHIDPI_BIN

This one did not boot and kept hanging at the boot splash screen. The host CPU usage was high and it indicated there was some busy activity inside the VM like an IRQ storm. After hooking up the debugger it turned out that the VM was hanging in two busy loops.

One of them was in the Audio driver - during the splash screen Windows plays a welcome sound. This was worked around by setting the "TX FIFO Ready" status in the sound codec. The second freeze was in the touchscreen driver but that looks like a MINI2440 emulation bug - once the touchscreen is disabled, the workqueue timer in qemu is permanently disabled and the status bits are not updated properly. I commented out the routine which disabled the touchscren controller since it's a VM anyway (not a real device where it would have a power-saving impact).
Obligatory screensot: Windows Mobile 5.0 Pocket PC

Retarded idea.

I was fantasizing about adding a logic to QEMU which would detect if emulation was stuck invoking MMIO device handler in a loop and fuzzing the returned register value until it was unstuck. While not very practical, one could reverse-engineer the register bits blindly this way.

Further ideas.

I don't think I'll invest more time into this project because there's little value but I'm considering developing an Android port of this QEMU fork just for the fun of it. Perhaps a better option would be to emulate a newer SoC such as Qualcomm 8250 and run an ARMv7 image from HTC HD2.

It's quite sad that most if not all Android phones ship with HYP mode (Hypervisor) disabled so KVM is a no-go. On the other hand, allowing to run custom hypervisors opens up a pandora box of security issues so it might be a good decision. Luckily for us who like to tinker with hardware, most TrustZone implementations also contain exploitable bugs so there are some possibilities to uncover the potential :)

Today, many companies working with embedded SoCs, seek GPU virtualization solutions. The demand is particularly high in the automotive sector where people are forced to use eclectic mixes of relatively old SoCs and retaining binary compatibility. So it would be interesting to prototype a solution similar to Intel's GVT.

I think it would be nice to use Qualcomm Adreno GPU as an emulation target. It is relatively well reverse-engineered - there exists Mesa FreeDreno driver for it, Qualcomm commits patches to the upstream KMS driver and most importantly the GPU ISA is similar to older AMD Radeon GPUs which is extensively documented by AMD. Besides, a similar GPU is used in Xbox 360 so one more place to learn is Xenia emulator which simulates the Xbox GPU.

Wednesday, September 28, 2016

Trying out and debugging Open-Source OpenCL on AMD Radeon RX480

Introduction.

I have decided to check out the state of OpenCL support for AMD GPUs in Linux Open-Source stack. In this post I will describe the experience of debugging issues across the whole stack trying to get a few basic apps up and running.

I have the AMD RX480 GPU which supports GCN 8.0.1 instruction set and has the code name "polaris10". At the time of writing, both Linux kernel and the Mesa in Debian were too old and did not support this GPU. Besides, as we will see later, the OpenCL in Mesa does not work out of the box and we would need to learn to build it from source anyway.

Building the software.

A page at FreeDesktop website has some relatively outdated instructions but the general approach is the same. https://dri.freedesktop.org/wiki/GalliumCompute/
You will need to install a relatively fresh kernel (I built Linux 4.7.0-rc7). I also installed the polaris10 firmware manually but now it seems to be shipped in Linux/Debian.

I'm posting the steps I went through to build LLVM and Mesa with OpenCL support. After writing it I realized that perhaps everything here is redundant I should write a repo manifest to clone everything with one command.

Getting the sources.

I'm also posting the folder tree and git hashes just in case.

Build CLC

CLC is the runtime library for the OpenCL. It contains code that is compiled to LLVM bitcode and linked against your apps. It provides intrinsics and functions for the functions defined by the OpenCL standard such as "get_global_id" and "mad". In case you're wondering, CUDA works exactly the same way and the binary SDK from NVIDIA ships the relevant bitcodes (and you can disassemble them with llvm-dis if you're interested).

Actually, some of the definitions failed to compile with latest LLVM and I needed to add the explicit type casting. You can use the patch (from this github gist https://gist.github.com/astarasikov/9f00dee718f217c6a9715510dc09d300) and apply it on top of 88b82a6f70012a903b10dfc1e2304d3ef2e76dbc to fix it.

git clone https://github.com/llvm-mirror/libclc.git
./configure.py -g ninja --with-llvm-config=/home/alexander/Documents/workspace/builds/llvm/build/bin/llvm-config
ninja
ninja install

Get LLVM Sources.

mkdir -p ~/Documents/workspace/builds/llvm/
I have a useful script to check out the latest llvm, clang and libcxx. https://gist.github.com/astarasikov/a2e9287a34381f680d58

Get LLVM ROC branch.

This is optional. I used the ROC branch initially because I thought it would fix the issue with codegen (FLAT instructions) but it did not and otherwise it seems to behave identical to vanilla LLVM.

git remote add roc https://github.com/RadeonOpenCompute/llvm.git
git checkout roc/amd-common
git fetch roc
cd tools/clang/
git remote add roc https://github.com/RadeonOpenCompute/clang.git
git fetch roc
git checkout roc/amd-common

List of git hashes:
  • llvm - 1819637 Merge branch amd-master into amd-common
  • llvm/tools/clang/tools/extra - 079dd6a [clang-rename] Add comment after namespace closing
  • llvm/tools/clang - f779a93 Merge branch amd-master into amd-common
  • llvm/projects/libcxx - d979eed Fix Bug 30240 - std::string: append(first, last) error when aliasing.
  • llvm/projects/compiler-rt - 5a27c81 asan: allow __asan_{before,after}_dynamic_init without registered globals
  • llvm/projects/libcxxabi - 9f08403 [lit] Replace print with lit_config.note().

Build LLVM.

#create the build directory
mkdir -p ~Documents/workspace/builds/llvm/build/
cd cd ~Documents/workspace/builds/llvm/build/

cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_INCLUDE_TESTS=OFF -DLLVM_VERSION_SUFFIX="" ../llvm/ -DBUILD_SHARED_LIBS=ON
ninja

#add the symlink to make LLVM pick up internal headers when building Mesa
cd ~Documents/workspace/builds/llvm/build/include
ln -s $(echo $PWD/../tools/clang/include/clang) clang

In principle, it is not necessary to install llvm, it's enough to add it to the PATH and clang will pick up the necessary libraries itself.

Build MESA

Before building Mesa, we need to prepend the path to the "bin" directory of our custom LLVM build to the PATH variable so that clang is picked up as the compiler. I also had to add a symlink to the source code in the build directory because some headers were not getting picked up but I think there's a cleaner way to add it to CFLAGS.

I was using Mesa git 0d7ec8b7d0554382d5af6c59a69ca9672d2583cd.

git clone git://anongit.freedesktop.org/mesa/mesa

The configure.ac seems to have the incorrect regex for getting LLVM version which causes compilation to fail with the latest LLVM (4.0.0). Here is a patch to fix it and also force the radeonsi chip class to VI (Volcanic Islands). The latter is not strictly necessary but I used it during debugging to ensure the corect code path is always hit. Grab the diff at https://gist.github.com/astarasikov/6146dbbd07d0dc3bea2ee6a8b979eaa8

export PATH=~/Documents/workspace/builds/llvm/build/bin:$PATH
cd ~/Documents/workspace/builds/mesa/mesa/
make clean

./autogen.sh --enable-texture-float  --enable-dri3 --enable-opencl --enable-opencl-icd --enable-sysfs --enable-gallium-llvm --with-gallium-drivers=radeonsi --prefix=/opt/my_mesa --with-egl-platforms=drm --enable-glx-tls
make install

Now, before running any OpenCL application, we'll need to override the library path to point to our custom Mesa.
export LD_LIBRARY_PATH=/opt/my_mesa/lib:/home/alexander/Documents/workspace/builds/llvm/build/lib

Useful Links

AMD Presentations about GCN ISA.

GCN ISA Manual

Intel OpenCL Samples

AMD App SDK

I've used the older version because I thought it was in the tarball and the latest one seemed to be an executable file (though actually it was a tarball with an executable script).

Trying it out.

Assertion in LLVM codegen.

So we can try running any OpenCL application now and we'll hit the assertion.
./BitonicSort -p Clover

After reading the source code and LLVM git log it turns out.
So what can we do? Let's try to see if we can force the LLVM to emit the abovementioned "FLAT Atomics".
Turns out the code is already there and there is a flag, which is enabled by default when the LLVM target is "AMD HSA".

Now, let's think what could the possible limitations of this approach be?

FLAT Instructions

Let's see the description from the GCN Manual. "Flat memory instructions let the kernel read or write data in memory, or perform atomic operations on data already in memory. These operations occur through the texture L2 cache.".

I have not fully understood the difference between these new FLAT instructions and older MUBUF/MTBUF. As it seems to me, before GCN3 it used to be the case that different address spaces (global, private etc) could only be accessed through different instructions and FLAT instructions allow accessing any piece of GPU (and host) memory by the virtual address (hence the name since virtual address space is flat). So it seems that as long as the kernel driver sets up GPU page tables correctly and we're only using the memory allocated through OpenCL API we should be fine.

FWIW, let's try running the code and see if it works.

LLVM Flat Instructions hack

As it has been mentioned before, we need to force LLVM to generate the "FLAT" instructions to access memory in GPU code.
A proper way to fix it would be to add the options to the Mesa source code into the location where it instantiates the LLVM compiler (clang).
To save us some time we can hack the relevant piece of the code generator in the LLVM directly (see the patch at the end of the article).

Trying it out again.

I have tried the following samples from the AMD App SDK and Intel Samples. I didn't want to write sample code myself and besides OpenCL claims to be portable so running the code from other IHVs or SDKs should be a good stress test of the toolchain.

AMD

  • BlackScholesDP
  • BinominalOption
  • BitonicSort
  • BinarySearch
  • BufferBandwidth

Intel

  • Bitonic Sort
  • Montecarlo
  • God Rays
All of them worked and the "verification" step which computes the data on the host and compares to the GPGPU result has passed! You can take a look at the screenshots and logs at the end of the post.

The "God Rays" demo even produces the convinceable picture.

Running Protonect (libfreenect2)

One of the apps I'm particularly interested in running is Protonect which is the demo app for libfreenect2, the open-source driver for Microsoft Kinect v2 RGBD ToF camera. Let's build it with OpenCL support and invoke it from the shell via "./Protonect cl".

And we're hitting an assertion!
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.
Since it happens in a destructor for the purposes of testing we can simply comment it out because the worst thing that could happen is a memory leak.

Let's try running it again!
And we're hitting another error. This time it is "unsupported initializer for address space". Okay, let's debug it. First, let's grep the string verbatim.

Good, we're hitting in only one place. Now, debugging message is not particularly helpful because it does not give us the precise location or the name of the variable which caused this error, it only shows the prototype of the function. Let's try to just print the address space type and try to find out what might be causing it. (see the patch at the end of the article).

What's this? Good, good, we see that the address space enum value is "0". Looking it up reveals that it's a private address space. Okay, what could cause the private address space to be used? Function-local arrays! Let's look for one! Great, here it is, see the "const float gaussian[9]" array in the "filterPixelStage1" function? Let's try commenting it out and replacing the "gaussian[j]" access with some constant (let it be 1.0f since it's a weighed average, otherwise if we choose 0.0f we'll see nothing in the output buffer). Yay, it worked!

Since we could not longer use the private address space we would need to find a way to get rid of the locally-declared array.OpenCL 1.1 does not support the static storage class.
One option would be to add another kernel argument and just pass the array there.
It might be slower depending though because it would get placed into the slower region of cache-coherent memory.

Another option would be to compute the values in-place and since it's just a 3x3 convolution kernel for a gaussian blur it's easy to come up with a crude approximation formula which is what I've done (see the patch at the end of the post).

Sor far, Protonect works. Performance is subpar but not completely awful. It's around 4 times slower than the NVIDIA GTX970 with the binary driver (in most real-world scenarios GTX970 and RX480 are quite close). I think that with a bit of profiling it can be sped up drastically. In fact, one of the contributing factors might be that my display is connected to the Intel iGPU and PCIE bandwidth is saturated by the framebuffer blitting (it's 4K after all). I'll try with OpenGL running on Radeon next time.

RadeonOpenCompute Initiative.

Since OpenCL is crap and CUDA is all the hype, it makes now wonder many people want to run CUDA on all GPUs.
In principle, to run CUDA on non-NVIDIA hardware, one needs to implement the CUDA runtime and the lowering from the NVPTX intermediate language to the native ISA. The challenging part is actually building the source code because one would need the proprietary headers from the NVIDIA toolchain. One could create the fake headers to mimic the CUDA toolchain but it is a shady legal area especially when it comes down to predefined macros and constants.

In fact, there's not much need to emit NVPTX at all and you can lower straight to the ISA. What AMD have done to work around legal issues is they've come up with their own language called "HIP" which mimics most of CUDA design but keywords and predefined macros are named differently. Therefore, porting is straightforward even with a search/replace function, but there's an automated translator based on clang.

GCN old RX480 vs R390

It looks interesting that Polaris10 (RX480) seems to use an older version of the ISA (8.0.1) while the older R390 uses 8.0.3. Not sure if it's a bug in documentation. However, it's interesting that AMD GPUs consist of multiple units (such as video encoder/decoder) and they seem to be picked up in arbitrary order when designing a new chip.

HIP without Mesa.

HIP/ROC ships its own runtime, and since all memory access is done through DRM via ioctls to "/dev/drm/cardX", Mesa is tecnhically not needed to implement OpenCL or whatever compute API.

However, the open question is buffer sharing between different APIs. I have came across this issue before when dealing with an Intel GPU. The good news is that on Linux there's an EGL extension to export the DRM buffer object (BO) from the GLES context (but not from GLX). You can read the old article about it here: https://allsoftwaresucks.blogspot.com/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html

Outstanding issues.

Slow Compilation time for Kernels.

While running OpenCL code samples and Protonect, I've noticed that they take several seconds to start up compared to immediate start when using Intel OpenCL driver (Beignet). I think it might get caused by LLVM. It would be a good idea to profile everything using the Linux "perf" tool.

Private Address Spaces.

As we've seen with libfreenect2 kernels, private address spaces do not work. It is necessary to figure out if they are not supported at all or it is just a simple bug.
Until this is resolved it effectively renders a lot of GPGPU applications unusable.

LLVM Assertions (Bugs).

As mentioned before, running Protonect and compiling libfreenect2 kernels yields the following assertion:
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

While trying to run one of the OpenCL samples I hit yet another assertion:
'clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.

Conslusions.

As we can see, the OpenCL support is still not mature with the Mesa/Clover stack.
However given the RadeonOpenCompute initiative I'm sure most of the missing features will be in place soon.
So far I'm glad that the just-releases GPU support is on par with older models and working around the issues was not too hard for me.
I've also satisfied part of my desire to understand the interaction between different components involved in the OpenCL/Compute pipeline.

I think for a start I will look at the LLVM assertions and see if I can debug them or prepare the test cases to submit upstream.
Next up I'll be trying out HIP to build some CUDA samples.

One idea I had in mind for quite some time was virtualizing a mobile GPU. I think Qualcomm Adreno is a good target because it's relatively well supported by the FreeDreno driver and the ISA is similar to other AMD chips. The plan is to add the ISA decoder and MMIO space emulation to QEMU so that it can be used both in KVM on ARM and in emulation mode on Intel. Of course, the most nerdy way to do it would be to make a translator from the guest ISA to the host ISA. But for a start we could reuse the Virgil driver as a target.
I think it would be a very useful thing for running legacy applications in a virtualized environment (such as Windows RT or automotive IVI systems) and could aid in security engineering.
Hopefully I will have enough motivation and time to do it before I'm bounded by an NDA :)

Latest update!

Also, check out the latest news! Looks like Mesa has now switched to using the HSA ABI by default which means that the hack for the FLAT instructions will not be needed with more recent versions and they will be enabled automagically! https://www.phoronix.com/scan.php?page=news_item&px=RadeonSI-HSA-Compute-Shaders

I started trying OpenCL on RX480 around 2 weeks ago, then I spent 1 week debugging and 1 week I was away. Meanwhile some changes seem to have landed upstream and some of the hacking described here may be redundant. I urge you to check with the latest source code but I decided to keep this post just to describe the process of debugging I went through.

Extra: Logs and Screenshots.

God Rays from Intel OpenCL samples.
Protonect running on AMD FOSS OpenCL stack.

LLVM Force FLAT instructions hack.

diff --git a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 3c4b5e7..f6d500c 100644
--- a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -46,7 +46,7 @@ AMDGPUSubtarget::initializeSubtargetDependencies(const Triple &TT,
   // disable it.

   SmallString<256> FullFS("+promote-alloca,+fp64-denormals,+load-store-opt,");
-  if (isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
+  if (1 || isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
     FullFS += "+flat-for-global,+unaligned-buffer-access,";
   FullFS += FS;

Patch for Libfreenect2

@@ -102,8 +102,8 @@ void kernel processPixelStage1(global const short *lut11to16, global const float
 /*******************************************************************************
  * Filter pixel stage 1
  ******************************************************************************/
-void kernel filterPixelStage1(global const float3 *a, global const float3 *b, global const float3 *n,
-                              global float3 *a_out, global float3 *b_out, global uchar *max_edge_test)
+void kernel filterPixelStage1(__global const float3 *a, __global const float3 *b, __global const float3 *n,
+                              __global float3 *a_out, __global float3 *b_out, __global uchar *max_edge_test)
 {
   const uint i = get_global_id(0);

@@ -113,7 +113,7 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
   const float3 self_a = a[i];

   const float3 self_b = b[i];

-  const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};
+  //const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};

   if(x < 1 || y < 1 || x > 510 || y > 422)
   {
@@ -155,7 +155,9 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
         const int3 c1 = isless(other_norm * other_norm, threshold);

         const float3 dist = 0.5f * (1.0f - (self_normalized_a * other_normalized_a + self_normalized_b * other_normalized_b));
-        const float3 weight = select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        //const float3 weight = 1.0f;//select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        const float gj = exp(0.6 - (0.3 * (abs(yi) + abs(xi))));
+        const float3 weight = select(gj * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);

LLVM Patch for assertion in AsmPrinter destructor.

diff --git a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 0fed4e9..0d63a2a 100644
--- a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -114,6 +114,7 @@ AsmPrinter::AsmPrinter(TargetMachine &tm, std::unique_ptr<MCStreamer> Streamer)
 }

 AsmPrinter::~AsmPrinter() {
+       return;
   assert(!DD && Handlers.empty() && "Debug/EH info didn't get finalized");

   if (GCMetadataPrinters) {

Patch for debugging Address Space issues.

diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 682157b..d2a5c4a 100644
--- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -766,6 +766,8 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
     unsigned Offset = MFI->allocateLDSGlobal(DL, *GV);
     return DAG.getConstant(Offset, SDLoc(Op), Op.getValueType());
   }
+  default:
+       printf("%s: address space type=%d\n", __func__, G->getAddressSpace());
   }

   const Function &Fn = *DAG.getMachineFunction().getFunction();

OpenCL Bandwidth Test (AMD App SDK)

Intel (Beignet GPGPU Driver)

Platform found : Intel

Device  0            Intel(R) HD Graphics Haswell GT2 Desktop
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.11  ns
Page fault           531.44  ns
CPU read             15.31 GB/s
memcpy()             15.54 GB/s
memset(,1,)          26.54 GB/s
memset(,0,)          27.06 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 9513.290
 ---------------------------------------|---------------
 memset() (GBPS)                        | 24.746
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 6176.168


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 38.225

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 26.198


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9830.400
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 15.431
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 10485.760

 Verification Passed!


Passed!

AMD Radeon (OpenCL) 

Platform found : Mesa


Device  0            AMD POLARIS10 (DRM 3.2.0 / 4.7.0-rc7-meow+, LLVM 4.0.0)
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.12  ns
Page fault           538.31  ns
CPU read             12.19 GB/s
memcpy()             11.38 GB/s
memset(,1,)          20.93 GB/s
memset(,0,)          22.98 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 7586.161
 ---------------------------------------|---------------
 memset() (GBPS)                        | 6.369
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 12822.261


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 113.481

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 105.898


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9.559
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 17.179
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 4060.750

 Verification Passed!


Passed!

Thursday, March 10, 2016

Fuzzing Vulkans, how do they work?

Introduction


Disclaimer: I have not yet fully read the specs on SPIR-V or Vulkan.

I decided to find out how hard it is to crash code working with SPIR-V.
Initially I wanted to crash the actual GPU drivers but for a start I decided
to experiment with GLSLang.

What I got


I used the "afl-fuzz" fuzzer to generate test cases that could crash
the parser. I have briefly examined the generated cases.
I have uploaded the results (which contain the SPIR-V binaries causing
the "spirv-remap" to crash) to the [following location](https://drive.google.com/file/d/0B7wcN-tOkdeRTGItSDhFM0JYUEk/view?usp=sharing)

Some of them trigger assertions in the code (which is not bad, but perhaps
returning an error code and shutting down cleanly would be better).
Some of them cause the code to hang for long time or indefinitely (which is worse
especially if someone intends to use the SPIR-V parser code online in the app).

Perhaps some of the results marked as "hangs" just cause too long compilation
time and could produce more interesting results if the timeout in "afl-fuzz"
is increased.

Two notable examples causing long compilation time are:
"out/crashes/id:000000,sig:06,src:000000,op:flip1,pos:15"
"out/hangs/id:000011,src:000000,op:flip1,pos:538" - for this one I waited for
a minute but it stil did not complete the compilation while causing 100% CPU load.

A log output of "glslang" indicating that most of the error cases found are handled, but with "abort" instead of graceful shutdown.
http://pastebin.com/BnZ63tKJ

NVIDIA

I have also tried using these shaders with the NVIDIA driver (since it was the only hardware I could run a real Vulkan driver on).

I have used the "instancing" demo from [SaschaWillems Repository](https://github.com/SaschaWillems/Vulkan) .
I patched it to accept the path to the binary shader via the command line.
Next, I fed it with the generated
test cases. Some of them triggered segfaults inside the NVIDIA driver.
What is curious is that when i used the "hangs" examples, they also caused
the NVIDIA driver to take extra long time to compile and eventually crash
at random places.

I think it indicates either that there is some common code between the driver
and GLSLang (the reference implementation) or the specification is missing
some sanity check somewhere and the compiler can get stuck optimizing certain
code.
Is there a place in specification that mandates that all the values are
checked to be within the allowed range, and all complex structures (such as
function calls) are checked recursively?



Perhaps I should have a look at other drivers (Mali anyone?).

```
[ 3672.137509] instancing[26631]: segfault at f ip 00007fb4624adebf sp 00007ffefd72e100 error 4 in libnvidia-glcore.so.355.00.29[7fb462169000+1303000]
[ 3914.294222] instancing[26894]: segfault at f ip 00007f00b28fcebf sp 00007ffdb9bab980 error 4 in libnvidia-glcore.so.355.00.29[7f00b25b8000+1303000]
[ 4032.430179] instancing[27017]: segfault at f ip 00007f7682747ebf sp 00007fff46679bf0 error 4 in libnvidia-glcore.so.355.00.29[7f7682403000+1303000]
[ 4032.915849] instancing[27022]: segfault at f ip 00007fb4e4099ebf sp 00007fff3c1ac0f0 error 4 in libnvidia-glcore.so.355.00.29[7fb4e3d55000+1303000]
[ 4033.011699] instancing[27023]: segfault at f ip 00007f7272900ebf sp 00007ffdb54261e0 error 4 in libnvidia-glcore.so.355.00.29[7f72725bc000+1303000]
[ 4033.107939] instancing[27025]: segfault at f ip 00007fbf0353debf sp 00007ffde4387750 error 4 in libnvidia-glcore.so.355.00.29[7fbf031f9000+1303000]
[ 4033.203924] instancing[27026]: segfault at f ip 00007f0f9a6f0ebf sp 00007ffff85a9dd0 error 4 in libnvidia-glcore.so.355.00.29[7f0f9a3ac000+1303000]
[ 4033.299138] instancing[27027]: segfault at 2967000 ip 00007fcb42cab720 sp 00007ffcbad45228 error 6 in libc-2.19.so[7fcb42c26000+19f000]
[ 4033.394667] instancing[27028]: segfault at 36d2000 ip 00007efc789eb720 sp 00007fff26c636d8 error 6 in libc-2.19.so[7efc78966000+19f000]
[ 4033.490918] instancing[27029]: segfault at 167b15e170 ip 00007f3b02095ec3 sp 00007ffd768cbf68 error 4 in libnvidia-glcore.so.355.00.29[7f3b01cbe000+1303000]
[ 4033.586699] instancing[27030]: segfault at 2ffc000 ip 00007fdebcc06720 sp 00007fff4fe59bd8 error 6 in libc-2.19.so[7fdebcb81000+19f000]
[ 4033.682939] instancing[27031]: segfault at 8 ip 00007fb80e7eed50 sp 00007ffe9cd21de0 error 4 in libnvidia-glcore.so.355.00.29[7fb80e410000+1303000]
[ 4374.480872] show_signal_msg: 27 callbacks suppressed
[ 4374.480876] instancing[27402]: segfault at f ip 00007fd1fc3cdebf sp 00007ffe483ff520 error 4 in libnvidia-glcore.so.355.00.29[7fd1fc089000+1303000]
[ 4374.809621] instancing[27417]: segfault at 2e0c3910 ip 00007f39af846e96 sp 00007ffe1c6d8f10 error 6 in libnvidia-glcore.so.355.00.29[7f39af46f000+1303000]
[ 4374.905112] instancing[27418]: segfault at 2dc46a68 ip 00007f7b9ff7af32 sp 00007fff290edf00 error 6 in libnvidia-glcore.so.355.00.29[7f7b9fba2000+1303000]
[ 4375.001019] instancing[27419]: segfault at f ip 00007f5a4e066ebf sp 00007ffe0b775d70 error 4 in libnvidia-glcore.so.355.00.29[7f5a4dd22000+1303000]
[ 4375.096894] instancing[27420]: segfault at f ip 00007f7274d49ebf sp 00007ffe96fdea10 error 4 in libnvidia-glcore.so.355.00.29[7f7274a05000+1303000]
[ 4375.193165] instancing[27421]: segfault at f ip 00007fa3bf3c8ebf sp 00007ffc4117e8d0 error 4 in libnvidia-glcore.so.355.00.29[7fa3bf084000+1303000]
[ 4375.288969] instancing[27423]: segfault at f ip 00007f50e0327ebf sp 00007ffc02aa1d50 error 4 in libnvidia-glcore.so.355.00.29[7f50dffe3000+1303000]
[ 4375.385530] instancing[27424]: segfault at f ip 00007f0d9a32eebf sp 00007ffd0298eb40 error 4 in libnvidia-glcore.so.355.00.29[7f0d99fea000+1303000]
[ 4375.481829] instancing[27425]: segfault at f ip 00007f8400bc5ebf sp 00007ffef0334240 error 4 in libnvidia-glcore.so.355.00.29[7f8400881000+1303000]
[ 4375.576983] instancing[27426]: segfault at 2dec2bc8 ip 00007f52260afec3 sp 00007fffd2bd1728 error 4 in libnvidia-glcore.so.355.00.29[7f5225cd8000+1303000]
```

How to reproduce


Below are the steps I have taken to crash the "spirv-remap" tool.
I believe this issue is worth looking at because some vendors may
choose to build their driver internals based on the reference implementations
which may lead to bugs directly crippling into the software as-is.

0. I have used the Debian Linux box. I have installed the "afl-fuzz" tool,
and also manually copied the Vulkan headers to "/usr/include".

1. cloned the GLSLang repository
```
git clone git@github.com:KhronosGroup/glslang.git
cd glslang
```

2. Compiled it with afl-fuzz
```
mkdir build
cat SetupLinux.sh 
cd build/
cmake -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++ ..
cd ..
```

3. Compiled a sample shader from the GLSL to the SPIR-V format using
```
./build/install/bin/glslangValidator -V -i Test/spv.130.frag
```

4. Verified that the "spirv-remap" tool works on the binary
```
./build/install/bin/spirv-remap -v -i frag.spv -o /tmp/
```

5. Fed the SPIR-V binary to the afl-fuzz
```
afl-fuzz -i in -o out ./build/install/bin/spirv-remap -v -i @@ -o /tmp
```

6. Quickly discovered several crashes. I attach the screenshot of afl-fuzz
in the work.


7. Examined them.

First, I made a hex diff of the good and bad files. The command to generate
the diff is the following:
```
for i in out/crashes/*; do hexdump in/frag.spv > in.hex && hexdump $i > out.hex && diff -Naur in.hex out.hex; done > hex.diff
```

Next, I just ran the tool on all cases and generated the log of crash messages.
```
for i in out/crashes/*; do echo $i && ./build/install/bin/spirv-remap -i $i -o /tmp/ 2>&1 ;done > abort.log
```

Conclusions

Well, there are two obvious conclusions:
1. Vulkan/SPIR-V is still a WIP and drivers are not yet perfect
2. GPU drivers have always been notorious for poor compilers - not only codegen, but parsers and validators. Maybe part of the reason is that CPU compilers simply handle more complex code and therefore more edge cases have been hit already.

Thursday, February 25, 2016

Notes on firmware and tools

Introduction.

In this blog post I mainly want to summarize my latest experiments at using clang's static analyzer and some thoughts on what could be further done at the analyzer, and open-source software quality in general. It's mostly some notes I've decided to put up.

Here are the references to the people involved in developing clang static analyzer at Apple. I recommend following them on twitter and also reading their presentation on developing custom checkers for the analyzer.
Ted Kremenek - https://twitter.com/tkremenek
Anna Zaks - https://twitter.com/zaks_anna
Jordan Rose - https://twitter.com/UINT_MIN

"How to Write a Checker in 24 Hours" - http://llvm.org/devmtg/2012-11/Zaks-Rose-Checker24Hours.pdf
"Checker Developer Manual" - http://clang-analyzer.llvm.org/checker_dev_manual.html - This one requires some understanding of LLVM, so I recommend to get comfortable with using the analyzer and looking at AST dumps first.

There are not so many analyzer plugins which not made at Apple.


GLibc.

Out of curiosity I ran the clang analyzer on glibc. Setting it up was not a big deal - in fact, all that was needed was to just run the scan-build. It did a good job of intercepting the gcc calls and most of the sources were successfully analyzed. In fact, I did not do anything special, but even running the analyzer with the default configuration revealed a few true bugs, like the one showed in the screenshot.

For example, the "iconv" function has a check to see if the argument "outbuf" is NULL, which indicates that it is a valid case expected by the authors. The manpage for the function also says that passing a NULL argument as "outbuf" is valid. However, we can see that one of the branches is missing the similar check for NULL pointer, which probably resulted from a copy-paste in the past. So, passing valid pointers to "inbuf" and "inbytesleft" and a NULL pointer for "outbuf" leads to the NULL pointer dereference and consequently a SIGSEGV.

Fun fact: my example is also not quite correct, as pointed out by my Twitter followers. The third argument to iconv must be a pointer to the integer, not the integer. However, on my machine my example crashes at dereferencing the "outbuf" pointer and not the "inbytesleft", because the argument evaluation order in C is not specified, and on x86_64 arguments are usually pushed to the stack (and therefore evaluated) in the reverse order.

Is it a big deal? Who knows. On the one hand, it's a userspace library, and unlike OpenSSL, no one is likely to embed it into kernel-level firmware. On the other, I may very well imagine a device such as a router or a web kiosk where this NULL pointer dereference could be triggered, because internalization and text manipulation is always a complex issue.


Linux Kernel.

I had this idea of trying to build LLVMLinux with clang for quite a while, but never really had the time to do it. My main interest in doing so was using the clang static analyzer.

Currently, some parts of Linux Kernel fail to build with clang, so I had to use the patches from the LLVMLinux project. They failed to apply cleanly though. I had to manually edit several patches. Another problem is that "git am" does not support the "fuzzy" strategy when applying the patches, so I had to use a certain script found on GitHub which uses "patch" and "git commit" to do the same thing.
https://gist.github.com/kfish/7425248

I have pushed my tree to github. I based it off the latest upstream at the time when I worked on it (which was the start of February 2016). The branch is called "4.5_with_llvmlinux".
https://github.com/astarasikov/linux/tree/4.5_with_llvmlinux

I've used the following commands to get the analysis results. Note that some files have failed to compile, and I had to manually stop the compilation job for one file that took over 40 minutes.

export CCC_CXX=clang++
export CCC_CC=clang
scan-build make CC=clang HOSTCC=clang -j10 -k


Ideas.

Porting clang instrumentations to major OpenSource projects.

Clang has a number of instrumentation plugins called "Sanitizers" which were largely developed by Konstantin Serebryany and Dmitry Vyukov at Google.

Arguably the most useful tool for C code is AddressSanitizer which allows to catch Use-After-Free and Out-of-Bounds access on arrays. There exists a port of the tool for the Linux Kernel, called Kernel AddressSanitizer, or KASAN, and it has been used to uncover a lot of memory corruption bugs, leading to potential vulnerabilities or kernel panics.
Another tool, also based on compile-time instrumentation, is the ThreadSanitizer for catching data races, and it has also been ported to the Linux Kernel.


I think it would be very useful to port these tools to the other projects. There are a lot of system-level software driving the critical aspects of system initialization process. To name a few:
  • EDK2 UEFI Development Kit
  • LittleKernel bootloader by Travis Geiselbrecht. It has been extensively used in the embedded world instead of u-boot lately. Qualcomm (and Codeaurora, its Open-Source department) is using a customized version of LK for its SoCs, and nearly every mobile phone with Android has LK inside it. NVIDIA have recently started shipping a version of LK called "TLK" or "Trusted Little Kernel" in its K1 processors, with the largest deployment yet being the Google Nexus 9 tablet.
  • U-boot bootloader, which is still used in a lot of devices.
  • FreeBSD and other kernels. While I don't know how widely they are deployed today, it would still be useful, at least for junior and intermediate kernel hackers.
  • XNU Kernel powering Apple iOS and OS X. I have some feeling that Apple might be using some of the tools internally, though the public snapshots of the source are heavily stripped of newest features.
Btw, if anyone is struggling at getting userspace AddressSanitizer working with NVIDIA drivers, take a look at this thread where I posted a workaround. Long story short, NVIDIA mmaps memory at large chunks up to a certain range, and you need to change the virtual address which ASan claims for itself.

Note that you will likely get problems if you build a library with a customized ASan and link it to something with unpatched ASan, so I recommend using a distro where you can rebuild all packages simultaneously if you want to experiment with custom instrumentation, such as NixOS, Gentoo or *BSD.

As I have mentioned before, there is a caveat with all these new instrumentation tools - most of them are made for the 64-bit architecture while the majority of the embedded code stays 32-bit. There are several ways to address it, such as running memory checkers in a separate process or using a simulator such as a modified QEMU, but it's an open problem.

Other techniques for firmware quality

In one of my previous posts I have drawn attention to gathering code coverage in low-level software, such as bootloaders. With GCOV being quite easy to port, I think more projects should be exposed to it. Also, AddressSanitizer now comes with a custom infrastructure for gathering code coverage, which is more extensive than GCOV.

Here's a link to the post which shows how one can easily embed GCOV to get coverage using Little Kernel as an example.

While userspace applications have received quite a lot of fuzzing recently, especially with the introduction of the AFL fuzzer tool, kernel-side received little attention. For Linux, there is a syscall fuzzer called Trinity.

There is also an interesting project at fuzzing kernel through USB stack (vUSBf).

What should be done is adapting this techniques for other firmware projects. On the one hand, it might get tricky due to the limited debugging facilities available in some kernels. On the other one, the capabilities provided by simulators such as QEMU are virtually unlimited (pun unintended).

An interesting observation might be that there are limited sources of external data input on a system. They include processor interrupt vectors/handlers and MMIO hardware. As for the latter, in Linux and most other firmwares, there are certain facilities for working with MMIO - such as functions like "readb()", "readw()" and "ioremap". Also, if we're speaking of a simulator such as QEMU, we can identify memory regions of interest by walking the page table and checking the physical address against external devices, and also checking the access type bits - for example, the data marked as executable is more likely to be the firmware code, while the uncached contiguous regions of memory are likely DMA windows.

ARM mbed TLS library is a good example of a project that tries to integrate as many dynamic and static tools into its testing process. However, it can be built as a userspace library on desktop, which makes it less interesting in the context of Firmware security.

Another technique that has been catching my attention for a lot of time is symbolic execution. In many ways, it is a similar problem to the static analysis - you need to find a set of input values constrained by certain equations to trigger a specific execution path leading to the incorrect state (say, a NULL pointer dereference).

Rumors are, a tool based on this technique called SAGE is actively used at Microsoft Research to analyze internal code, but sadly there are not so many open-source and well-documented tools one can easily play with and directly plug into an existing project.

An interesting example of applying this idea to the practical problem at a large and established corporation (Intel) is presented in the paper called "Symbolic execution for BIOS security" which tries to utilize symbolic execution with KLEE for attacking firmware - the SMM handler. You can find more interesting details in the blog of one of the authors of the paper - Vincent Zimmer (https://twitter.com/vincentzimmer and http://vzimmer.blogspot.com/).

Also, not directly related, but an interesting presentation about bug hunting on Android.
http://www.slideshare.net/jiahongfang5/qualcomm2015-jfang-nforest

I guess now I will have to focus on studying recent exploitation techniques used for hacking XNU, Linux and Chromium to have a more clear picture of what's needed to achieve :)

Further developing clang analyzer.

One feature missing from clang is the ability to perform the analysis across the translation units. A Translation Unit (or TU) in clang/LLVM roughly represents a file on the disk being parsed. Clang analyzer is implemented as a pass which traverses the AST and is limited to the one translation unit. Which is not quite true - it will go and analyze the includes recursively. But if you have two separate "C" files, which do not include one another, and a function from one file is calling the function from another one, the analysis will not work.

Implementing a checker across multiple sources is not a trivial thing, as pointed out by the analyzer authors in the mailing list, though this topic is often brought up by different people.

Often, it is possible to come up with a workarounds, especially if one aims at implementing ad-hoc checks for their project. The simplest one would be creating a top-level "umbrella" C file which would include all files with implementations of the functions. I have seen some projects do exactly this. The most obvious shortcomings of this approach is that it will require redesigning the whole structure of the project, and will not work if some of the translation units need custom C compiler options.

Another option would be to dump/serialize the AST and any additional information during the compilation of each TU and process it after the whole project is built. It looks like this approach has been proposed multiple times on the mailing list, and there exist at least one paper which claims doing that.

"Towards Vulnerability Discovery Using Extended Compile-time Analysis" - http://arxiv.org/pdf/1508.04627.pdf

In fact, the analyzer part itself might very well be independent of the actual parser, and could be reused, for example, to analyze the data obtained by the disassembly, but it's a topic for another research area.

Developing ad-hoc analyzers for certain projects.

While it is very difficult to statically rule out many kinds of errors in an arbitrary project, either due to state explosion problem or the dynamic nature of the code, quite often we can design tools to verify some contracts specific to a certain project.

An example of such tool would be a tool called "sparse" from the Linux Kernel. It effectively works as a parser for the C code and can be made to run on every C file compiled by the GCC while building the kernel. It allows to specify some annotations to the declarations in the C code. It works similar to how attributes were implemented in GCC and Clang later.

One notable example of the code in the Linux Kernel, which deals with passing void pointers around and relying on pointer trickery via the "container_of" macro is the workqueue library.

While working on the FreeBSD kernel during the GSoC in 2014, I faced a similar problem while developing device drivers - at certain places pointers were cast to void, and casted back to typed ones where needed. Certainly it's easy to make a mistake while coding these things.

Now, if we dump enough information during compilation, we can implement advanced checks. For example, when facing a function pointer which is initialized dynamically, we can do two things. First, find all places where it can potentially be initialized. Second, find all functions with a matching prototype. While checking all of them might be time consuming and generate false positives, it will also allow to check more code statically at compilation time.

A notable source of problems when working with C code is that linking stage is traditionally separated from the compilation stage. The linker usually manipulates the abstract "symbols" which are just void pointers. Even though it could be possible to store enough information about the types in a section of the ELF (in fact, DWARF debugging data contains information about the types) and use it to type-check symbols when linking, it's not usually done.

It leads to certain funky problems. For example, aliases (weak aliases) are a linker-time feature. If one defines an alias to some function, where the type of the alias does not match the type of the original function, the compiler cannot check it (well, it could if someone wrote a checker, but it does not), and you will get a silent memory corruption at runtime. I once ran into this issue when porting the RUMP library which ships a custom LIBC, and our C library had different size for "off_t".

Refactoring

There are two ideas I had floating around for quite a while.

Microkernelizing Linux.

An interesting research problem would be coming up with a tool to automatically convert/rewrite linux-kernel code into a microkernel with device drivers and other complex pieces of code staying in separate protection domains (processes).

It is interesting for several reasons. One is that a lot of microkernels, such as L4, rely on DDE-Kits to run pieces of Linux and other systems (such as NetBSD in case of RUMP) to provide drivers. Another is that there's a lot of tested code with large functionality, which could possibly made more secure by minimizing the impact of memory corruption.

Besides obvious performance concerns there are a lot of research questions to this.

  • Converting access to global variables to IPC accesses. Most challenging part would be dealing with function pointers and callbacks.
  • Parsing KConfig files and "#ifdef" statements to ensure all conditional compilation cases are covered when refactoring. This in itself is a huge problem for every C codebase - if you change something in one branch of an "#ifdef" statement, you cannot guarantee you didn't break it for another branch. To get whole coverage, it would be useful to come up with some automated way to ensure all configurations of "#ifdef" are built.
  • Deciding which pieces of code should be linked statically and reside in the same process. For example, one might want to make sound subsystem, drm and vfs to run as separate processes, but going as far as having each TU converted to a process would be an overkill and a performance disaster.

Translating C code to another language. Not really sure if it could be really useful. It is likely that any complex code involving pointers and arrays would require a language with similar capabilities as a target (if we're speaking of generating a useful and readable idiomatic code, and not just a translator such as emscripten). Therefore, the target language might very well have the same areas with unspecified behavior. Some people have proposed for a stricter and a more well-defined dialect of C.

One may note that it is not necessary to use clang for any of these tasks. In fact, one can get away with writing a custom parser or hacking GCC. These options are perfectly legal. I had some of these ideas floating around for at least three years, but back then I didn't have skills to build GCC and hack on it, and now I've been mostly using clang, but it's not a big deal.

Conclusion

Overall, neither improving the compiler nor switching to another language alone would not save us from errors. What's often overlooked is the process of the continuous integration, with regression tests to show what became broken, with gathering coverage, with doing integration tests.
There are a lot of non-obvious parameters which are difficult to measure. For example, how to set up a regression test that would detect a software breakage due to a new compiler optimization.

Besides, we could possibly borrow a lot of practices from the HDL/Digital Design industry. Besides coverage, an interesting idea is simulating the same design in different languages with different semantics to hope that if there are mistakes at the implementation stage, they are not at the same places, and testing will show where outputs of two systems diverge.

P.S.

Ping me if you're interested in working on the topics mentioned above :)