Thursday, March 10, 2016

Fuzzing Vulkans, how do they work?

Introduction


Disclaimer: I have not yet fully read the specs on SPIR-V or Vulkan.

I decided to find out how hard it is to crash code working with SPIR-V.
Initially I wanted to crash the actual GPU drivers but for a start I decided
to experiment with GLSLang.

What I got


I used the "afl-fuzz" fuzzer to generate test cases that could crash
the parser. I have briefly examined the generated cases.
I have uploaded the results (which contain the SPIR-V binaries causing
the "spirv-remap" to crash) to the [following location](https://drive.google.com/file/d/0B7wcN-tOkdeRTGItSDhFM0JYUEk/view?usp=sharing)

Some of them trigger assertions in the code (which is not bad, but perhaps
returning an error code and shutting down cleanly would be better).
Some of them cause the code to hang for long time or indefinitely (which is worse
especially if someone intends to use the SPIR-V parser code online in the app).

Perhaps some of the results marked as "hangs" just cause too long compilation
time and could produce more interesting results if the timeout in "afl-fuzz"
is increased.

Two notable examples causing long compilation time are:
"out/crashes/id:000000,sig:06,src:000000,op:flip1,pos:15"
"out/hangs/id:000011,src:000000,op:flip1,pos:538" - for this one I waited for
a minute but it stil did not complete the compilation while causing 100% CPU load.

A log output of "glslang" indicating that most of the error cases found are handled, but with "abort" instead of graceful shutdown.
http://pastebin.com/BnZ63tKJ

NVIDIA

I have also tried using these shaders with the NVIDIA driver (since it was the only hardware I could run a real Vulkan driver on).

I have used the "instancing" demo from [SaschaWillems Repository](https://github.com/SaschaWillems/Vulkan) .
I patched it to accept the path to the binary shader via the command line.
Next, I fed it with the generated
test cases. Some of them triggered segfaults inside the NVIDIA driver.
What is curious is that when i used the "hangs" examples, they also caused
the NVIDIA driver to take extra long time to compile and eventually crash
at random places.

I think it indicates either that there is some common code between the driver
and GLSLang (the reference implementation) or the specification is missing
some sanity check somewhere and the compiler can get stuck optimizing certain
code.
Is there a place in specification that mandates that all the values are
checked to be within the allowed range, and all complex structures (such as
function calls) are checked recursively?



Perhaps I should have a look at other drivers (Mali anyone?).

```
[ 3672.137509] instancing[26631]: segfault at f ip 00007fb4624adebf sp 00007ffefd72e100 error 4 in libnvidia-glcore.so.355.00.29[7fb462169000+1303000]
[ 3914.294222] instancing[26894]: segfault at f ip 00007f00b28fcebf sp 00007ffdb9bab980 error 4 in libnvidia-glcore.so.355.00.29[7f00b25b8000+1303000]
[ 4032.430179] instancing[27017]: segfault at f ip 00007f7682747ebf sp 00007fff46679bf0 error 4 in libnvidia-glcore.so.355.00.29[7f7682403000+1303000]
[ 4032.915849] instancing[27022]: segfault at f ip 00007fb4e4099ebf sp 00007fff3c1ac0f0 error 4 in libnvidia-glcore.so.355.00.29[7fb4e3d55000+1303000]
[ 4033.011699] instancing[27023]: segfault at f ip 00007f7272900ebf sp 00007ffdb54261e0 error 4 in libnvidia-glcore.so.355.00.29[7f72725bc000+1303000]
[ 4033.107939] instancing[27025]: segfault at f ip 00007fbf0353debf sp 00007ffde4387750 error 4 in libnvidia-glcore.so.355.00.29[7fbf031f9000+1303000]
[ 4033.203924] instancing[27026]: segfault at f ip 00007f0f9a6f0ebf sp 00007ffff85a9dd0 error 4 in libnvidia-glcore.so.355.00.29[7f0f9a3ac000+1303000]
[ 4033.299138] instancing[27027]: segfault at 2967000 ip 00007fcb42cab720 sp 00007ffcbad45228 error 6 in libc-2.19.so[7fcb42c26000+19f000]
[ 4033.394667] instancing[27028]: segfault at 36d2000 ip 00007efc789eb720 sp 00007fff26c636d8 error 6 in libc-2.19.so[7efc78966000+19f000]
[ 4033.490918] instancing[27029]: segfault at 167b15e170 ip 00007f3b02095ec3 sp 00007ffd768cbf68 error 4 in libnvidia-glcore.so.355.00.29[7f3b01cbe000+1303000]
[ 4033.586699] instancing[27030]: segfault at 2ffc000 ip 00007fdebcc06720 sp 00007fff4fe59bd8 error 6 in libc-2.19.so[7fdebcb81000+19f000]
[ 4033.682939] instancing[27031]: segfault at 8 ip 00007fb80e7eed50 sp 00007ffe9cd21de0 error 4 in libnvidia-glcore.so.355.00.29[7fb80e410000+1303000]
[ 4374.480872] show_signal_msg: 27 callbacks suppressed
[ 4374.480876] instancing[27402]: segfault at f ip 00007fd1fc3cdebf sp 00007ffe483ff520 error 4 in libnvidia-glcore.so.355.00.29[7fd1fc089000+1303000]
[ 4374.809621] instancing[27417]: segfault at 2e0c3910 ip 00007f39af846e96 sp 00007ffe1c6d8f10 error 6 in libnvidia-glcore.so.355.00.29[7f39af46f000+1303000]
[ 4374.905112] instancing[27418]: segfault at 2dc46a68 ip 00007f7b9ff7af32 sp 00007fff290edf00 error 6 in libnvidia-glcore.so.355.00.29[7f7b9fba2000+1303000]
[ 4375.001019] instancing[27419]: segfault at f ip 00007f5a4e066ebf sp 00007ffe0b775d70 error 4 in libnvidia-glcore.so.355.00.29[7f5a4dd22000+1303000]
[ 4375.096894] instancing[27420]: segfault at f ip 00007f7274d49ebf sp 00007ffe96fdea10 error 4 in libnvidia-glcore.so.355.00.29[7f7274a05000+1303000]
[ 4375.193165] instancing[27421]: segfault at f ip 00007fa3bf3c8ebf sp 00007ffc4117e8d0 error 4 in libnvidia-glcore.so.355.00.29[7fa3bf084000+1303000]
[ 4375.288969] instancing[27423]: segfault at f ip 00007f50e0327ebf sp 00007ffc02aa1d50 error 4 in libnvidia-glcore.so.355.00.29[7f50dffe3000+1303000]
[ 4375.385530] instancing[27424]: segfault at f ip 00007f0d9a32eebf sp 00007ffd0298eb40 error 4 in libnvidia-glcore.so.355.00.29[7f0d99fea000+1303000]
[ 4375.481829] instancing[27425]: segfault at f ip 00007f8400bc5ebf sp 00007ffef0334240 error 4 in libnvidia-glcore.so.355.00.29[7f8400881000+1303000]
[ 4375.576983] instancing[27426]: segfault at 2dec2bc8 ip 00007f52260afec3 sp 00007fffd2bd1728 error 4 in libnvidia-glcore.so.355.00.29[7f5225cd8000+1303000]
```

How to reproduce


Below are the steps I have taken to crash the "spirv-remap" tool.
I believe this issue is worth looking at because some vendors may
choose to build their driver internals based on the reference implementations
which may lead to bugs directly crippling into the software as-is.

0. I have used the Debian Linux box. I have installed the "afl-fuzz" tool,
and also manually copied the Vulkan headers to "/usr/include".

1. cloned the GLSLang repository
```
git clone git@github.com:KhronosGroup/glslang.git
cd glslang
```

2. Compiled it with afl-fuzz
```
mkdir build
cat SetupLinux.sh 
cd build/
cmake -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++ ..
cd ..
```

3. Compiled a sample shader from the GLSL to the SPIR-V format using
```
./build/install/bin/glslangValidator -V -i Test/spv.130.frag
```

4. Verified that the "spirv-remap" tool works on the binary
```
./build/install/bin/spirv-remap -v -i frag.spv -o /tmp/
```

5. Fed the SPIR-V binary to the afl-fuzz
```
afl-fuzz -i in -o out ./build/install/bin/spirv-remap -v -i @@ -o /tmp
```

6. Quickly discovered several crashes. I attach the screenshot of afl-fuzz
in the work.


7. Examined them.

First, I made a hex diff of the good and bad files. The command to generate
the diff is the following:
```
for i in out/crashes/*; do hexdump in/frag.spv > in.hex && hexdump $i > out.hex && diff -Naur in.hex out.hex; done > hex.diff
```

Next, I just ran the tool on all cases and generated the log of crash messages.
```
for i in out/crashes/*; do echo $i && ./build/install/bin/spirv-remap -i $i -o /tmp/ 2>&1 ;done > abort.log
```

Conclusions

Well, there are two obvious conclusions:
1. Vulkan/SPIR-V is still a WIP and drivers are not yet perfect
2. GPU drivers have always been notorious for poor compilers - not only codegen, but parsers and validators. Maybe part of the reason is that CPU compilers simply handle more complex code and therefore more edge cases have been hit already.

Thursday, February 25, 2016

Notes on firmware and tools

Introduction.

In this blog post I mainly want to summarize my latest experiments at using clang's static analyzer and some thoughts on what could be further done at the analyzer, and open-source software quality in general. It's mostly some notes I've decided to put up.

Here are the references to the people involved in developing clang static analyzer at Apple. I recommend following them on twitter and also reading their presentation on developing custom checkers for the analyzer.
Ted Kremenek - https://twitter.com/tkremenek
Anna Zaks - https://twitter.com/zaks_anna
Jordan Rose - https://twitter.com/UINT_MIN

"How to Write a Checker in 24 Hours" - http://llvm.org/devmtg/2012-11/Zaks-Rose-Checker24Hours.pdf
"Checker Developer Manual" - http://clang-analyzer.llvm.org/checker_dev_manual.html - This one requires some understanding of LLVM, so I recommend to get comfortable with using the analyzer and looking at AST dumps first.

There are not so many analyzer plugins which not made at Apple.


GLibc.

Out of curiosity I ran the clang analyzer on glibc. Setting it up was not a big deal - in fact, all that was needed was to just run the scan-build. It did a good job of intercepting the gcc calls and most of the sources were successfully analyzed. In fact, I did not do anything special, but even running the analyzer with the default configuration revealed a few true bugs, like the one showed in the screenshot.

For example, the "iconv" function has a check to see if the argument "outbuf" is NULL, which indicates that it is a valid case expected by the authors. The manpage for the function also says that passing a NULL argument as "outbuf" is valid. However, we can see that one of the branches is missing the similar check for NULL pointer, which probably resulted from a copy-paste in the past. So, passing valid pointers to "inbuf" and "inbytesleft" and a NULL pointer for "outbuf" leads to the NULL pointer dereference and consequently a SIGSEGV.

Fun fact: my example is also not quite correct, as pointed out by my Twitter followers. The third argument to iconv must be a pointer to the integer, not the integer. However, on my machine my example crashes at dereferencing the "outbuf" pointer and not the "inbytesleft", because the argument evaluation order in C is not specified, and on x86_64 arguments are usually pushed to the stack (and therefore evaluated) in the reverse order.

Is it a big deal? Who knows. On the one hand, it's a userspace library, and unlike OpenSSL, no one is likely to embed it into kernel-level firmware. On the other, I may very well imagine a device such as a router or a web kiosk where this NULL pointer dereference could be triggered, because internalization and text manipulation is always a complex issue.


Linux Kernel.

I had this idea of trying to build LLVMLinux with clang for quite a while, but never really had the time to do it. My main interest in doing so was using the clang static analyzer.

Currently, some parts of Linux Kernel fail to build with clang, so I had to use the patches from the LLVMLinux project. They failed to apply cleanly though. I had to manually edit several patches. Another problem is that "git am" does not support the "fuzzy" strategy when applying the patches, so I had to use a certain script found on GitHub which uses "patch" and "git commit" to do the same thing.
https://gist.github.com/kfish/7425248

I have pushed my tree to github. I based it off the latest upstream at the time when I worked on it (which was the start of February 2016). The branch is called "4.5_with_llvmlinux".
https://github.com/astarasikov/linux/tree/4.5_with_llvmlinux

I've used the following commands to get the analysis results. Note that some files have failed to compile, and I had to manually stop the compilation job for one file that took over 40 minutes.

export CCC_CXX=clang++
export CCC_CC=clang
scan-build make CC=clang HOSTCC=clang -j10 -k


Ideas.

Porting clang instrumentations to major OpenSource projects.

Clang has a number of instrumentation plugins called "Sanitizers" which were largely developed by Konstantin Serebryany and Dmitry Vyukov at Google.

Arguably the most useful tool for C code is AddressSanitizer which allows to catch Use-After-Free and Out-of-Bounds access on arrays. There exists a port of the tool for the Linux Kernel, called Kernel AddressSanitizer, or KASAN, and it has been used to uncover a lot of memory corruption bugs, leading to potential vulnerabilities or kernel panics.
Another tool, also based on compile-time instrumentation, is the ThreadSanitizer for catching data races, and it has also been ported to the Linux Kernel.


I think it would be very useful to port these tools to the other projects. There are a lot of system-level software driving the critical aspects of system initialization process. To name a few:
  • EDK2 UEFI Development Kit
  • LittleKernel bootloader by Travis Geiselbrecht. It has been extensively used in the embedded world instead of u-boot lately. Qualcomm (and Codeaurora, its Open-Source department) is using a customized version of LK for its SoCs, and nearly every mobile phone with Android has LK inside it. NVIDIA have recently started shipping a version of LK called "TLK" or "Trusted Little Kernel" in its K1 processors, with the largest deployment yet being the Google Nexus 9 tablet.
  • U-boot bootloader, which is still used in a lot of devices.
  • FreeBSD and other kernels. While I don't know how widely they are deployed today, it would still be useful, at least for junior and intermediate kernel hackers.
  • XNU Kernel powering Apple iOS and OS X. I have some feeling that Apple might be using some of the tools internally, though the public snapshots of the source are heavily stripped of newest features.
Btw, if anyone is struggling at getting userspace AddressSanitizer working with NVIDIA drivers, take a look at this thread where I posted a workaround. Long story short, NVIDIA mmaps memory at large chunks up to a certain range, and you need to change the virtual address which ASan claims for itself.

Note that you will likely get problems if you build a library with a customized ASan and link it to something with unpatched ASan, so I recommend using a distro where you can rebuild all packages simultaneously if you want to experiment with custom instrumentation, such as NixOS, Gentoo or *BSD.

As I have mentioned before, there is a caveat with all these new instrumentation tools - most of them are made for the 64-bit architecture while the majority of the embedded code stays 32-bit. There are several ways to address it, such as running memory checkers in a separate process or using a simulator such as a modified QEMU, but it's an open problem.

Other techniques for firmware quality

In one of my previous posts I have drawn attention to gathering code coverage in low-level software, such as bootloaders. With GCOV being quite easy to port, I think more projects should be exposed to it. Also, AddressSanitizer now comes with a custom infrastructure for gathering code coverage, which is more extensive than GCOV.

Here's a link to the post which shows how one can easily embed GCOV to get coverage using Little Kernel as an example.

While userspace applications have received quite a lot of fuzzing recently, especially with the introduction of the AFL fuzzer tool, kernel-side received little attention. For Linux, there is a syscall fuzzer called Trinity.

There is also an interesting project at fuzzing kernel through USB stack (vUSBf).

What should be done is adapting this techniques for other firmware projects. On the one hand, it might get tricky due to the limited debugging facilities available in some kernels. On the other one, the capabilities provided by simulators such as QEMU are virtually unlimited (pun unintended).

An interesting observation might be that there are limited sources of external data input on a system. They include processor interrupt vectors/handlers and MMIO hardware. As for the latter, in Linux and most other firmwares, there are certain facilities for working with MMIO - such as functions like "readb()", "readw()" and "ioremap". Also, if we're speaking of a simulator such as QEMU, we can identify memory regions of interest by walking the page table and checking the physical address against external devices, and also checking the access type bits - for example, the data marked as executable is more likely to be the firmware code, while the uncached contiguous regions of memory are likely DMA windows.

ARM mbed TLS library is a good example of a project that tries to integrate as many dynamic and static tools into its testing process. However, it can be built as a userspace library on desktop, which makes it less interesting in the context of Firmware security.

Another technique that has been catching my attention for a lot of time is symbolic execution. In many ways, it is a similar problem to the static analysis - you need to find a set of input values constrained by certain equations to trigger a specific execution path leading to the incorrect state (say, a NULL pointer dereference).

Rumors are, a tool based on this technique called SAGE is actively used at Microsoft Research to analyze internal code, but sadly there are not so many open-source and well-documented tools one can easily play with and directly plug into an existing project.

An interesting example of applying this idea to the practical problem at a large and established corporation (Intel) is presented in the paper called "Symbolic execution for BIOS security" which tries to utilize symbolic execution with KLEE for attacking firmware - the SMM handler. You can find more interesting details in the blog of one of the authors of the paper - Vincent Zimmer (https://twitter.com/vincentzimmer and http://vzimmer.blogspot.com/).

Also, not directly related, but an interesting presentation about bug hunting on Android.
http://www.slideshare.net/jiahongfang5/qualcomm2015-jfang-nforest

I guess now I will have to focus on studying recent exploitation techniques used for hacking XNU, Linux and Chromium to have a more clear picture of what's needed to achieve :)

Further developing clang analyzer.

One feature missing from clang is the ability to perform the analysis across the translation units. A Translation Unit (or TU) in clang/LLVM roughly represents a file on the disk being parsed. Clang analyzer is implemented as a pass which traverses the AST and is limited to the one translation unit. Which is not quite true - it will go and analyze the includes recursively. But if you have two separate "C" files, which do not include one another, and a function from one file is calling the function from another one, the analysis will not work.

Implementing a checker across multiple sources is not a trivial thing, as pointed out by the analyzer authors in the mailing list, though this topic is often brought up by different people.

Often, it is possible to come up with a workarounds, especially if one aims at implementing ad-hoc checks for their project. The simplest one would be creating a top-level "umbrella" C file which would include all files with implementations of the functions. I have seen some projects do exactly this. The most obvious shortcomings of this approach is that it will require redesigning the whole structure of the project, and will not work if some of the translation units need custom C compiler options.

Another option would be to dump/serialize the AST and any additional information during the compilation of each TU and process it after the whole project is built. It looks like this approach has been proposed multiple times on the mailing list, and there exist at least one paper which claims doing that.

"Towards Vulnerability Discovery Using Extended Compile-time Analysis" - http://arxiv.org/pdf/1508.04627.pdf

In fact, the analyzer part itself might very well be independent of the actual parser, and could be reused, for example, to analyze the data obtained by the disassembly, but it's a topic for another research area.

Developing ad-hoc analyzers for certain projects.

While it is very difficult to statically rule out many kinds of errors in an arbitrary project, either due to state explosion problem or the dynamic nature of the code, quite often we can design tools to verify some contracts specific to a certain project.

An example of such tool would be a tool called "sparse" from the Linux Kernel. It effectively works as a parser for the C code and can be made to run on every C file compiled by the GCC while building the kernel. It allows to specify some annotations to the declarations in the C code. It works similar to how attributes were implemented in GCC and Clang later.

One notable example of the code in the Linux Kernel, which deals with passing void pointers around and relying on pointer trickery via the "container_of" macro is the workqueue library.

While working on the FreeBSD kernel during the GSoC in 2014, I faced a similar problem while developing device drivers - at certain places pointers were cast to void, and casted back to typed ones where needed. Certainly it's easy to make a mistake while coding these things.

Now, if we dump enough information during compilation, we can implement advanced checks. For example, when facing a function pointer which is initialized dynamically, we can do two things. First, find all places where it can potentially be initialized. Second, find all functions with a matching prototype. While checking all of them might be time consuming and generate false positives, it will also allow to check more code statically at compilation time.

A notable source of problems when working with C code is that linking stage is traditionally separated from the compilation stage. The linker usually manipulates the abstract "symbols" which are just void pointers. Even though it could be possible to store enough information about the types in a section of the ELF (in fact, DWARF debugging data contains information about the types) and use it to type-check symbols when linking, it's not usually done.

It leads to certain funky problems. For example, aliases (weak aliases) are a linker-time feature. If one defines an alias to some function, where the type of the alias does not match the type of the original function, the compiler cannot check it (well, it could if someone wrote a checker, but it does not), and you will get a silent memory corruption at runtime. I once ran into this issue when porting the RUMP library which ships a custom LIBC, and our C library had different size for "off_t".

Refactoring

There are two ideas I had floating around for quite a while.

Microkernelizing Linux.

An interesting research problem would be coming up with a tool to automatically convert/rewrite linux-kernel code into a microkernel with device drivers and other complex pieces of code staying in separate protection domains (processes).

It is interesting for several reasons. One is that a lot of microkernels, such as L4, rely on DDE-Kits to run pieces of Linux and other systems (such as NetBSD in case of RUMP) to provide drivers. Another is that there's a lot of tested code with large functionality, which could possibly made more secure by minimizing the impact of memory corruption.

Besides obvious performance concerns there are a lot of research questions to this.

  • Converting access to global variables to IPC accesses. Most challenging part would be dealing with function pointers and callbacks.
  • Parsing KConfig files and "#ifdef" statements to ensure all conditional compilation cases are covered when refactoring. This in itself is a huge problem for every C codebase - if you change something in one branch of an "#ifdef" statement, you cannot guarantee you didn't break it for another branch. To get whole coverage, it would be useful to come up with some automated way to ensure all configurations of "#ifdef" are built.
  • Deciding which pieces of code should be linked statically and reside in the same process. For example, one might want to make sound subsystem, drm and vfs to run as separate processes, but going as far as having each TU converted to a process would be an overkill and a performance disaster.

Translating C code to another language. Not really sure if it could be really useful. It is likely that any complex code involving pointers and arrays would require a language with similar capabilities as a target (if we're speaking of generating a useful and readable idiomatic code, and not just a translator such as emscripten). Therefore, the target language might very well have the same areas with unspecified behavior. Some people have proposed for a stricter and a more well-defined dialect of C.

One may note that it is not necessary to use clang for any of these tasks. In fact, one can get away with writing a custom parser or hacking GCC. These options are perfectly legal. I had some of these ideas floating around for at least three years, but back then I didn't have skills to build GCC and hack on it, and now I've been mostly using clang, but it's not a big deal.

Conclusion

Overall, neither improving the compiler nor switching to another language alone would not save us from errors. What's often overlooked is the process of the continuous integration, with regression tests to show what became broken, with gathering coverage, with doing integration tests.
There are a lot of non-obvious parameters which are difficult to measure. For example, how to set up a regression test that would detect a software breakage due to a new compiler optimization.

Besides, we could possibly borrow a lot of practices from the HDL/Digital Design industry. Besides coverage, an interesting idea is simulating the same design in different languages with different semantics to hope that if there are mistakes at the implementation stage, they are not at the same places, and testing will show where outputs of two systems diverge.

P.S.

Ping me if you're interested in working on the topics mentioned above :)

Monday, January 25, 2016

On GPU ISAs and hacking

Recently I've been reading several interesting posts on hacking Windows OpenGL drivers to exploit the "GL_ARB_get_program_binary" extension to access raw GPU instructions.



Reverse engineering is always cool and interesting. However, a lot of these efforts have been previously done not once because people directed their efforts at making open-source drivers for linux, and I believe these projects are a good way to jump-start into GPU architecture without reading huge vendor datasheets.

Since I'm interested in the GPU drivers, I decided to put up links to several open-source projects that could be interesting to fellow hackers and serve as a reference, potentially more useful than vendor documentation (which is not available for some GPUs. you know, the green ones).

Please also take a look at this interesting blog post for the detailed list of GPU datasheets - http://renderingpipeline.com/graphics-literature/low-level-gpu-documentation/

* AMD R600 GPU Instruction Set Manual. It is an uber-useful artifact because many modern GPUs (older AMD desktop GPUs, Qualcomm Adreno GPUs and Xbox 360 GPU) bear a lot of similarities with AMD R300/R600 GPU series.
http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf

Xenia Xbox 360 Emulator. It features a Dynamic Binary Translation module that translates the AMD instruction stream into GLSL shaders at runtime. Xbox 360 has an AMD Radeon GPU, similar to the one in R600 series and Adreno
https://github.com/benvanik/xenia

Let's look at "src/xenia/gpu". Warning: the code at this project seems to be changing very often and the links may quickly become stale, so take a look yourself.
One interesting file is "register_table.inc" which has register definitions for MMIO space allowing you to figure out a lot about how the GPU is configured.
The most interesting file is "src/xenia/gpu/ucode.h which contains the structures to describe ISA.
src/xenia/gpu/packet_disassembler.cc contains the logic to disassemble the actual GPU packet stream at binary level
src/xenia/gpu/shader_translator.cc contains the logic to translate the GPU bytecode to the IR
src/xenia/gpu/glsl_shader_translator.cc the code to translate Xenia IR to GLSL
Part of the Xbox->GLSL logic is also contained in "src/xenia/gpu/gl4/gl4_command_processor.cc" file.

* FreeDreno - the Open-Source Driver for the Qualcomm/ATI Adreno GPUs. These GPUs are based on the same architecture that the ATI R300 and consequently AMD R600 series, and both ISAs and register spaces are quite similar. This driver was done by Rob Clark, who is a GPU driver engineer with a lot of experience who (unlike most FOSS GPU hackers) previously worked on another GPU at a vendor and thus had enough experience to successfully reverse-engineer a GPU.

https://github.com/freedreno/mesa

Take a look at "src/gallium/drivers/freedreno/a3xx/fd3_program.c" to have an idea how the GPU command stream packet is formed on the A3XX series GPU which is used in most Android smartphones today. Well, it's the same for most GPUs - there is a ring-buffer to which the CPU submits the commands, and from where the GPU reads them out, but the actual format of the buffer is vendor and model specific.

I really love this work and this driver because it was the first open-source driver for mobile ARM SoCs that provided full support for OpenGL ES and non-ES enough to run Gnome Shell and other window managers with GPU accelleration and also supports most Qualcomm GPUs.

Mesa. Mesa is the open-source stack for several compute and graphic APIs for Linux: OpenGL, OpenCL, OpenVG, VAAPI and OpenMAX.

Contrary to the popular belief, it is not a "slow" and "software-rendering" thing. Mesa has several backends. "Softpipe" is indeed a slow and CPU-side implementation of OpenGL, "llvmpipe" is a faster one using LLVM for vectorizing. But the most interesting part are the drivers for the actual hardware. Currently, Mesa supports most popular GPUS (with the notable exception of ARM Mali). You can peek into Mesa source code to get insight into how to control both the 2D/ROP engines of the GPUs and how to upload the actual instructions (the ISA bytecode) to the GPU. Example links are below. It's a holy grail for GPU hacking.

"http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i915/i915_program.c" contains the code to emit I915 instructions.
The code for the latest GPUs - Broadwell - is in the "i965" driver.




While browsing Mesa code, one will notice the mysterious "bo" name. It stands for "Buffer Object" and is an abstraction for the GPU memory region. It is handled by the userpace library called "libdrm" - http://cgit.freedesktop.org/mesa/drm/tree/

Linux Kernel "DRM/KMS" drivers. KMS/DRM is a facility of Linux Kernel to handle GPU buffer management and basic GPU initialization and power management in kernel space. The interesting thing about these drivers is that you can peek at the places where the instructions are actually transferred to the GPU - the registers containing physical addresses of the code.
There are other artifacts. For example, I like this code in the Intel GPU driver (i915) which does runtime patching of GPU instruction stream to relax security requirements.

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_cmd_parser.c

* Beignet - an OpenCL implementation for Intel Ivy-Bridge and later GPUs. Well, it actually works.
http://cgit.freedesktop.org/beignet/
The interesting code to emit the actual command stream is at http://cgit.freedesktop.org/beignet/tree/src/intel/intel_gpgpu.c

* LibVA Intel H.264 Encoder driver. Note that the proprietary MFX SDK also uses the LibVA internally, albeit a patched version with newer code which typically gets released and merged into upstream LibVA after it goes through Intel internal testing.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/
An interesting artifact are the sources with the encoder pseudo-assembly.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/interpolate_Y_8x8.asm
http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/AVCMCInter.asm

Btw, If you want zero-copy texture sharing to pass OpenGL rendering directly to the H.264 encoder with either OpenGL ES or non-ES, please refer to my earlier post - http://allsoftwaresucks.blogspot.ru/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html and to Weston VAAPI recorder.

* Intel KVM-GT and Xen-GT. These are GPU virtualization solutions from Intel. They emulate the PCI GPU in the VM. They expose the MMIO register space fully compatible with the real GPU so that the guest OS uses the vanilla or barely modified driver. The host contains the code to reinterpret the GPU instruction stream and also to set up the host GPU memory space in such a way that it's partitioned equally between the clients (guest VMs).

There are useful papers, and interesting pieces of code.

* Intel GPU Tools
A lot of useful tools. Most useful one IMHO is the "intel_perf_counters" which shows compute unit, shader and video encoder utilization. There are other interesting tools like register dumper.


It also has the Broadwell GPU assembler

And the ugly Shaders in ugly assembly similar to those in LibVA

I once wrote the tool complimentary to the register dumper - to upload the register dump back to the GPU registers. I used it to debug some issues with the eDP panel initialization in my laptop


* LunarGLASS is a collection of experimental GPU middleware. As far as I understand it, it is an ongoing initiative to build prototype implementations of upcoming Khronos standards before trying to get them merged to mesa. And also to allow the code to be used with any proprietary GPU SW stack, not tied to Mesa. Apparently it also has the "GLSL" backend which should allow you to try out SPIR-V on most GPU stacks.

The main site - http://lunarg.com/lunar-glass/
The SPIR-V frontend source code - https://github.com/LunarG/LunarGLASS/tree/master/Frontends/SPIRV
The code to convert LunarG IR to Mesa IR - https://github.com/LunarG/LunarGLASS/blob/master/mesa/src/mesa/program/ir_to_mesa.cpp

* Lima driver project. It was a project for reverse-engineering the ARM Mali driver. The authors (Luc Verhaegen aka "libv" and Connor Abbot) have decoded some part of the initialization routine and ISA. Unfortunately, the project seems to have stopped due to legal issues. However, it has significant impact on the open-source GPU drivers, because Connor Abbot went on to implement some of the ideas (namely, the NIR - "new IR") in the Mesa driver stack.

Some interesting artifacts:
The place where the Geometry Processor data (attributes and uniforms) are written to the GPU physical memory:
https://github.com/limadriver/lima/blob/master/limare/lib/gp.c#L333
"Render State" - a small structure describing the GPU memory, including the shader program address.
https://github.com/limadriver/lima/blob/master/limare/lib/render_state.h

* Etnaviv driver project. It was a project to reverse-engineer the Vivante GPU used by Freescale I.MX SoC series. It was done by Wladimir J. van der Laan. The driver has reached the mature state, successfully running games like Quake. Now, there is a version of Mesa with this driver built-in.

https://github.com/etnaviv/etna_viv
https://github.com/laanwj/etna_viv
https://github.com/etnaviv/mesa
http://wiki.wandboard.org/index.php/Etna_viv

Monday, January 18, 2016

Abusing QEMU and mmap() or Poor Man's PCI Pass-Through

The problem.

Another tale from the endless firmware endeavours.
The other day I ended up in a weird situation. I was locked in a room with three firmwares - one of them was built by myself from source, while the other two came in binary form from the vendor. Naturally the one I built was not fully working, and one of the binaries worked fine. So I set out on a quest to trace all PCI MMIO accesses.

When one thinks about it, several solutions come to mind. The first idea is to run a QEMU/XEN VM and pass-through the PCI device. Unfortunately the board we had did not have IOMMU (or VT-D in Intel terminology). The board had the Intel BayTrail SoC which is basically a CPU for phones.
Another option would be to binary-patch the prebuilt firmware and introduce a set of wrapper functions around MMIO access routines and add corresponding calls to printf() for debugging but this approach is very tedious and error-prone, and I try to avoid patching binaries because I'm extremely lazy.

So I figured the peripherals of interest (namely, the I2C controller and a NIC card) were simple devices configured by simple register writes. The host did not pass any shared memory to the device, so I used a simple technique. I patched QEMU and introduced "fake" PCI devices with VID and PID of the devices I wanted to "pass-through". By default the MMIO read access return zero (or some other default value) while writes are ignored. Then I went on and added the real device access. Since I was running QEMU on Linux, I just blacklisted and rmmod'ed the corresponding drivers. Then, I used "lspci" to find the addresses of the PCI BARs and mmap() syscall on "/dev/mem" to access the device memory.

A picture is worth a thousand words


The code.

The code is quite funky, but thanks to the combination of many factors (X86 being a strongly ordered architecture and the code using only 4-byte transfers) the naive implementation worked fine.
1. I can now trace all register reads and writes
2. I have verified that the "good" firmware still boots and everything is working fine.
So, after that I simply ran all three firmwares and compared the register traces only to find out that the problems might be unrelated to driver code (which is a good result because it allowed to narrow down the amount of code to explore).

The branch with my customized QEMU is here. Also, I figured QEMU now has the ultra-useful option called "readconfig" which allows you to supply the board configuration in a text file and avoid messing with cryptic command line switches for enabling PCI devices or HDDs.
https://github.com/astarasikov/qemu/tree/our_board
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7

The code for mmapping the PCI memory is here. It is a marvel of YOLO-based design, but hey, X11 does the same, so it is now industry standard and unix-way :)
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7#diff-febf7a335ad7cd658784899e875b5cc7R31

Limitations.

There is one obvious limitation, which is also the reason QEMU generally does not support passthrough of MMIO devices on systems without IOMMU (Like VT-D or SMMU).

Imagine a case where the device is not only configured via simple register transfers (such as setting certain bits to enable IRQ masking or clock gating), but system memory is passed to the device via storing a pointer in a register.

The problem is that the address coming from the guest VM (GPA, guest physical address) will not usually match the corresponding HPA (host physical address). The device however operates the bus addresses and is not aware of the MMU, let alone VMs. IOMMU solves this problem by translating the addresses coming from the device into virtual addresses, and the hypervisor can supply a custom page table to translate bus HPAs into GPAs.

So if one would want to provide a pass-through solution for sharing memory in systems without IOMMU, one would either need to ensure that the VM and guest have 1:1 GPA/HPA mappings for the addresses used by the device or come up with an elaborate scheme which would detect register writes containing memory buffers and set up a corresponding buffer in the host memory space, but then one will need to deal with keeping the host and guest buffers consistent. Besides, that would require the knowledge of the device registers, so it would be non-trivial to build a completely generic, plug-and-play solution.

Sunday, November 8, 2015

Porting legacy RTOS software to a custom OS.

So lately I've been working on porting a large-sized application from a well-known RTOS to a custom kernel. The RTOS is VxWorks and the application is a router firmware. Unfortunately I will not be able to share any code of the porting layer. I'll just briefly go over the problems I've endeavoured and some thoughts I have come up with in the process.

The VxWorks compiler (called "diab") is like many other proprietary C compilers based on the Edison frontend. Unlike GCC or clang, by default it is quite lax when it comes to following the standards. Besides, the original firmware project uses the VxWorks Workbench IDE (which is of course a fork of Eclipse) and runs on Windows.

The first thing I had to do was to convert the project build system to Linux. The desirable way would be to write a parser that would go over the original build system and produce a Makefile. The advantage of this approach would be the ability to instantly obtain a new Makefile when someone changes the original project code. However, during the prototyping stage it makes a lot of sense to take shortcuts wherever possible, so I ended up writing a tiny wrapper program to replace the compiler and linker binaries. It would intercept its arguments, write them out to a log file and act as a proxy launching the original program. After that it was just a matter of making minor changes to the log file to convert it to a bash script and later on to a Makefile.

As an effect of the development done on Windows, the majority of files have the same set of defects that prevent them from being compiled by a linux-based compiler: paths in the "#include" directive often have incorrect case, reverse slashes and so on. Worst is that some of the "#include" directives included multiple files in one set of brackets. Luckily, this was easy to fix with a script that parsed the compilation log for the "file not found" errors, looked for the corresponding file ignoring the case and fixed up the source code. In the end I was left with about a dozen places that had to be fixed manually.

Implementing the compatibility layer.

I have done a quick research into the available options and saw that the only up-to-date solution implementing the VxWorks API on other OSs is "Xenomai". However, it is quite intrusive because it relies on a loadable kernel module and some patches to the linux kernel to function. Since we were not interested in getting realtime behaviour but wanted to run on both our OS and Linux and entirely in userspace, I decided to write yet another VxWorks emulation layer.
The original firmware comes as a single ELF file which is reasonable because in VxWorks all processes are implemented as threads in a shared address space. Besides, VxWorks provides a POSIX-compatible API for developers. So in order to identify which functions needed implementation it was enough to try linking the compiled code into a single executable.

"One weird trick" useful for creating porting layers and DDEKits is the GCC/clang option "include" which allows you to prepend an include to absolutely all files compiled. This is super useful. You can use such an uber-header to place the definitions of the data types and function prototypes for the target platform. Besides, you can use it to hook library calls at compile-time.

One of the problem that took a great amount of time was implementing synchronization primitives. Namely, mutexes and semaphores. The tricky semantic difference between semaphores and mutexes in VxWorks is that the latter are recursive. That means that once a thread has acquired a mutex, it is allowed to lock it any number of times as long as the lock/unlock count is balanced.
Before I realized this semantic difference, I couldn't figure out why the software would always lock up, and disabling locking altogether led to totally random crashes.

Ultimately I became frustrated and ended up with a simple implementation of a recursive mutex that has allowed me to move much further (Simple Recursive Mutex Implementation @ github). Later for the purposes of debugging I also added the options to print backtrace indicating the previous lock owner when trying to enter the critical section or when the spinlock took too many attempts.

Hunting for code defects

Uninitialized variables and what can we do about them

One problem I came across was that the code had a lot of uninitialized variables, hundreds of them. On the one hand, the proper solution is to inspect each case manually and write the correct initializer. On the other hand, the code works when compiled with the diab compiler so it must zero-initialize them.
So I went ahead and wrote a clang rewriter plugin to add the initializers: zero for the primitive types and a pair of curly braces for structs. (clang rewriter to add initializers). However, I realized that the biggest problem is that some functions use the convention of returning zero for indicating a failure while other return a non-zero value. This means, we cannot have a generic safe initializer that would make the code take the "fault" path when reaching the rewritten code. An alternative to manual inspection could be writing sophisticated rules for the rewriter to detect the convention used.

I ended up using valgrind and manually patching some warnings. AddressSanitizer was also useful. However fixing each warning and creating a blacklist is too tiresome. I ended up setting the breakpoint on the "__asan_report_error" function and a script that would make gdb print backtrace, return and continue execution.

Duplicate structures

One problem I supposed could be present in the code (due to the deep hirarchy of #ifdefs) is the presence of the structures with the same name but different content. I made up an example of a C program to demonstrate an effect where the compiler does not warn about the type mimatch but at runtime the code silently corrupts memory.


I figured out an easy way of dealing with the problem. I ended up using clang and emitting LLVM bittecode for each file instead of the object files with machine code. Then I linked them together into a single bitcode file and disassembled with llvm-dis.

The nice thing about llvm is that when it finds two structures having the same name but declared differently, it would append a different numeric suffix to the struct name. Then one could just remove the suffixes and look for unique lines with different structure declarations.

Luckily for me, there was only one place where I supposed an incorrect definition, and it was not in the part of the code I was executing, so I ruled out this option as a source of incorrect behavior.

Further work

Improving tooling for the 32-bit world

It is quite unfortunate but the project I have been working on is 32-bit. And one cannot simply convert it into a 64-bit one by a compiler flag. The problem is that the code has quite a lot of places where pointers are for some reason stored into packed structures and some other structures implicitly rely on the structure layout. So it is very difficult to modify this fragile mess.
It is sad that two great tools, MemorySanitizer and ThreadSanitizer, are not available for the 32-bit applications. It is understandable because the 32-bit address space is too tiny to fit the shadow memory. I am thinking of ways to make them available for the 32-bit world. So far I see two ways of solving the problem.
First, we can use the fragile and non-portable flag for the mmap (which is currently only supported on linux) to force allocations to below the 4gig limit. Then one could write an ELF loader that would load all the code below 4gigs, and use the upper memory range for shadows. Besides being non-portable, the other disadvantages of this approach include having to deal with 64-bit identifiers such as file handles.
Alternatively we could store the shadow in a separate process and use shared memory or sockets for communication. That would likely be at least an order slower than using the corresponding sanitizer in a 64-bit world, but likely still faster than valgrind and besides it is a compile-time instrumentation with more internal data from the compiler.

Verifying the porting was done correctly

Now I am left with one challenging task: verify the ported SW is identical to what could be built using the original build system.

One may notice that simply intercepting the calls to the compiler may not be enough because the build system may copy the files or export some shell variables during the build process. Besides, different compilers have different ways of handling "inline" and some other directives. It would be good to verify that the call graph of the original binary and the one produced by our Makefile is similar. (of course we will need to mark some library functions as leaf nodes and not analyze them). For a start I could try inspecting some of the unresolved symbols manually, but I'm thinking of automating the process. I think for this task I'll need a decompiler that can identify basic blocks. Probably Capstone engine should do the job. Any ideas on that?

P.S. Oh, and I once tried visualizing the dependency graph of separate ".o" files (before I realized I could just link them altogether and get the list of missing symbols) and trust me those graphs grow really fast. I have found out that a tool called "gephi" does a decent job at visualizing really huge graphs and supports Graphviz's dot as the input format.

EDIT 2015-11-10
The challenge is complicated by the fact that there some subprojects have multiple copies (with various changes) and one should also ensure that the headers and sources are picked up from the correct copy. However, I've found an easy and acceptable solution. I just wrote a tool that parses the call graph generated by IDA and for every edge in the graph it looks up the function names of the corresponding vertices. Then it just prints a list of pairs "A -> B" for every function A calling function B. After that, we can sort the file alphabetically and remove the nodes that are uninteresting to us (the OS and library functions). Next, we can compare the files side-by-side with a tool like kdiff3 (or we can do it automatically). Whenever there is a significant difference (for example, 5 callees are different for the same caller), we can inspect manually and verify we're compiling the correct file with the correct options. Using this method I have identified several places where we chose the wrong object file for linking and now we're only concerned with the porting layer and OS kernel, without having to worry about the application itself.

Wednesday, May 20, 2015

on making a DDE kit, kinda

So I have this task of porting a huge piece of software running on a proprietary OS to another OS. And I don't even have a clue how to compile it (well I do but it builds on windows so it's almost irrelevant).

But luckily all code is linked into a single ELF file and the compilation produces intermediate object files. The first thought I had was to visualize the dependency graph of object files to find out who calls what. You can find the script below that will recursively walk the supplied directory and try to parse the import/export table with objdump. There are some areas for improvement (for example, parsing also the dynsym table with -T or parsing .a archives) but it did the work for me.

Unfortunately I realized that visualizing the graph with 30K edges was not even remotely a smart idea
https://gist.github.com/astarasikov/355bf825f130fe4b5633

What I've also found out was that the OS-specific code and objects was stored in a separate location (since it was a part of an SDK). Even if it were not, we could just remove those object files that were both present in our project and in the SDK. After that, all the functions that the application was requiring from the OS unsurprisingly ended up in the "UNDEFINED" node and there were only 200 of them which gives me some hope.

This approach can also be used for other use-cases. For example, porting drivers from Linux/FreeBSD to exotic platforms - first build the binaries, then pick as many of them as you can to minimize the required functions list. I find dealing with compiled code easier because build systems, C macros and ifdefs just drive me insane.

Friday, May 15, 2015

GCOV is amazing yet undocumented

One useful technique for maintaining software quality is code coverage. While routinely used by high-level developers it is completely forgotten by many C hackers, especially when it comes to kernel. In fact, Linux is the only kernel which supports being compiled with the GCOV coverage tool.



GCOV works by instrumenting your code. It inserts some code to increment the stats counters around each basic block. These counters reside in a special section of your ELFs. During compilation, GCC generates a ".gcno" file for each ".c" file. These files allow the "gcov" tool to lookup function names and other info using the integer IDs (which are specific to each file).

At runtime, an executable built with GCOV produces a file called ".gcda" which contains the values of the counters. During the executable initialization, constructors (which are function pointers in the ".ctors" section) are called. One of them is "__gcov_init" which registers a certain ".o" file inside libgcov.

Since we're running in kernel or "bare-metal", we don't have neither libgcov nor file system to dump the ".gcda" files. But one should not fear GCOV! Adding it to your kernel is just a matter of adding one C file (which is mostly shamelessly copy-pasted from linux kernel and gcc sources) and a couple CFLAGS. In my example I'm using the LK kernel by Travis Geiselbrecht (https://github.com/travisg/lk). I've decided to just print out the contents of the ".gcda" files to the serial console (UART) as hex dump and then use an AWK script and the "xxd" tool to convert them to binaries on the host. This works completely fine since these files are typically below 2KB in size.

An important thing to notice: if your kernel does not contain the ".ctors" section and does not call the constructors, be sure to add them to the ld script and add some code to invoke them. For example, here's how LK does that: https://github.com/travisg/lk/blob/master/top/main.c#L42

You can see the whole patch below.
https://github.com/astarasikov/lk/commit/2a4af09a894194dfaff3e05f6fd505241d54d074

After running the "gcovr" tool you can get a nice HTML with summary and see which lines were executed and which were not and add the tests for the latter. Woot!