Monday, January 25, 2016

On GPU ISAs and hacking

Recently I've been reading several interesting posts on hacking Windows OpenGL drivers to exploit the "GL_ARB_get_program_binary" extension to access raw GPU instructions.



Reverse engineering is always cool and interesting. However, a lot of these efforts have been previously done not once because people directed their efforts at making open-source drivers for linux, and I believe these projects are a good way to jump-start into GPU architecture without reading huge vendor datasheets.

Since I'm interested in the GPU drivers, I decided to put up links to several open-source projects that could be interesting to fellow hackers and serve as a reference, potentially more useful than vendor documentation (which is not available for some GPUs. you know, the green ones).

Please also take a look at this interesting blog post for the detailed list of GPU datasheets - http://renderingpipeline.com/graphics-literature/low-level-gpu-documentation/

* AMD R600 GPU Instruction Set Manual. It is an uber-useful artifact because many modern GPUs (older AMD desktop GPUs, Qualcomm Adreno GPUs and Xbox 360 GPU) bear a lot of similarities with AMD R300/R600 GPU series.
http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf

Xenia Xbox 360 Emulator. It features a Dynamic Binary Translation module that translates the AMD instruction stream into GLSL shaders at runtime. Xbox 360 has an AMD Radeon GPU, similar to the one in R600 series and Adreno
https://github.com/benvanik/xenia

Let's look at "src/xenia/gpu". Warning: the code at this project seems to be changing very often and the links may quickly become stale, so take a look yourself.
One interesting file is "register_table.inc" which has register definitions for MMIO space allowing you to figure out a lot about how the GPU is configured.
The most interesting file is "src/xenia/gpu/ucode.h which contains the structures to describe ISA.
src/xenia/gpu/packet_disassembler.cc contains the logic to disassemble the actual GPU packet stream at binary level
src/xenia/gpu/shader_translator.cc contains the logic to translate the GPU bytecode to the IR
src/xenia/gpu/glsl_shader_translator.cc the code to translate Xenia IR to GLSL
Part of the Xbox->GLSL logic is also contained in "src/xenia/gpu/gl4/gl4_command_processor.cc" file.

* FreeDreno - the Open-Source Driver for the Qualcomm/ATI Adreno GPUs. These GPUs are based on the same architecture that the ATI R300 and consequently AMD R600 series, and both ISAs and register spaces are quite similar. This driver was done by Rob Clark, who is a GPU driver engineer with a lot of experience who (unlike most FOSS GPU hackers) previously worked on another GPU at a vendor and thus had enough experience to successfully reverse-engineer a GPU.

https://github.com/freedreno/mesa

Take a look at "src/gallium/drivers/freedreno/a3xx/fd3_program.c" to have an idea how the GPU command stream packet is formed on the A3XX series GPU which is used in most Android smartphones today. Well, it's the same for most GPUs - there is a ring-buffer to which the CPU submits the commands, and from where the GPU reads them out, but the actual format of the buffer is vendor and model specific.

I really love this work and this driver because it was the first open-source driver for mobile ARM SoCs that provided full support for OpenGL ES and non-ES enough to run Gnome Shell and other window managers with GPU accelleration and also supports most Qualcomm GPUs.

Mesa. Mesa is the open-source stack for several compute and graphic APIs for Linux: OpenGL, OpenCL, OpenVG, VAAPI and OpenMAX.

Contrary to the popular belief, it is not a "slow" and "software-rendering" thing. Mesa has several backends. "Softpipe" is indeed a slow and CPU-side implementation of OpenGL, "llvmpipe" is a faster one using LLVM for vectorizing. But the most interesting part are the drivers for the actual hardware. Currently, Mesa supports most popular GPUS (with the notable exception of ARM Mali). You can peek into Mesa source code to get insight into how to control both the 2D/ROP engines of the GPUs and how to upload the actual instructions (the ISA bytecode) to the GPU. Example links are below. It's a holy grail for GPU hacking.

"http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i915/i915_program.c" contains the code to emit I915 instructions.
The code for the latest GPUs - Broadwell - is in the "i965" driver.




While browsing Mesa code, one will notice the mysterious "bo" name. It stands for "Buffer Object" and is an abstraction for the GPU memory region. It is handled by the userpace library called "libdrm" - http://cgit.freedesktop.org/mesa/drm/tree/

Linux Kernel "DRM/KMS" drivers. KMS/DRM is a facility of Linux Kernel to handle GPU buffer management and basic GPU initialization and power management in kernel space. The interesting thing about these drivers is that you can peek at the places where the instructions are actually transferred to the GPU - the registers containing physical addresses of the code.
There are other artifacts. For example, I like this code in the Intel GPU driver (i915) which does runtime patching of GPU instruction stream to relax security requirements.

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_cmd_parser.c

* Beignet - an OpenCL implementation for Intel Ivy-Bridge and later GPUs. Well, it actually works.
http://cgit.freedesktop.org/beignet/
The interesting code to emit the actual command stream is at http://cgit.freedesktop.org/beignet/tree/src/intel/intel_gpgpu.c

* LibVA Intel H.264 Encoder driver. Note that the proprietary MFX SDK also uses the LibVA internally, albeit a patched version with newer code which typically gets released and merged into upstream LibVA after it goes through Intel internal testing.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/
An interesting artifact are the sources with the encoder pseudo-assembly.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/interpolate_Y_8x8.asm
http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/AVCMCInter.asm

Btw, If you want zero-copy texture sharing to pass OpenGL rendering directly to the H.264 encoder with either OpenGL ES or non-ES, please refer to my earlier post - http://allsoftwaresucks.blogspot.ru/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html and to Weston VAAPI recorder.

* Intel KVM-GT and Xen-GT. These are GPU virtualization solutions from Intel. They emulate the PCI GPU in the VM. They expose the MMIO register space fully compatible with the real GPU so that the guest OS uses the vanilla or barely modified driver. The host contains the code to reinterpret the GPU instruction stream and also to set up the host GPU memory space in such a way that it's partitioned equally between the clients (guest VMs).

There are useful papers, and interesting pieces of code.

* Intel GPU Tools
A lot of useful tools. Most useful one IMHO is the "intel_perf_counters" which shows compute unit, shader and video encoder utilization. There are other interesting tools like register dumper.


It also has the Broadwell GPU assembler

And the ugly Shaders in ugly assembly similar to those in LibVA

I once wrote the tool complimentary to the register dumper - to upload the register dump back to the GPU registers. I used it to debug some issues with the eDP panel initialization in my laptop


* LunarGLASS is a collection of experimental GPU middleware. As far as I understand it, it is an ongoing initiative to build prototype implementations of upcoming Khronos standards before trying to get them merged to mesa. And also to allow the code to be used with any proprietary GPU SW stack, not tied to Mesa. Apparently it also has the "GLSL" backend which should allow you to try out SPIR-V on most GPU stacks.

The main site - http://lunarg.com/lunar-glass/
The SPIR-V frontend source code - https://github.com/LunarG/LunarGLASS/tree/master/Frontends/SPIRV
The code to convert LunarG IR to Mesa IR - https://github.com/LunarG/LunarGLASS/blob/master/mesa/src/mesa/program/ir_to_mesa.cpp

* Lima driver project. It was a project for reverse-engineering the ARM Mali driver. The authors (Luc Verhaegen aka "libv" and Connor Abbot) have decoded some part of the initialization routine and ISA. Unfortunately, the project seems to have stopped due to legal issues. However, it has significant impact on the open-source GPU drivers, because Connor Abbot went on to implement some of the ideas (namely, the NIR - "new IR") in the Mesa driver stack.

Some interesting artifacts:
The place where the Geometry Processor data (attributes and uniforms) are written to the GPU physical memory:
https://github.com/limadriver/lima/blob/master/limare/lib/gp.c#L333
"Render State" - a small structure describing the GPU memory, including the shader program address.
https://github.com/limadriver/lima/blob/master/limare/lib/render_state.h

* Etnaviv driver project. It was a project to reverse-engineer the Vivante GPU used by Freescale I.MX SoC series. It was done by Wladimir J. van der Laan. The driver has reached the mature state, successfully running games like Quake. Now, there is a version of Mesa with this driver built-in.

https://github.com/etnaviv/etna_viv
https://github.com/laanwj/etna_viv
https://github.com/etnaviv/mesa
http://wiki.wandboard.org/index.php/Etna_viv

Monday, January 18, 2016

Abusing QEMU and mmap() or Poor Man's PCI Pass-Through

The problem.

Another tale from the endless firmware endeavours.
The other day I ended up in a weird situation. I was locked in a room with three firmwares - one of them was built by myself from source, while the other two came in binary form from the vendor. Naturally the one I built was not fully working, and one of the binaries worked fine. So I set out on a quest to trace all PCI MMIO accesses.

When one thinks about it, several solutions come to mind. The first idea is to run a QEMU/XEN VM and pass-through the PCI device. Unfortunately the board we had did not have IOMMU (or VT-D in Intel terminology). The board had the Intel BayTrail SoC which is basically a CPU for phones.
Another option would be to binary-patch the prebuilt firmware and introduce a set of wrapper functions around MMIO access routines and add corresponding calls to printf() for debugging but this approach is very tedious and error-prone, and I try to avoid patching binaries because I'm extremely lazy.

So I figured the peripherals of interest (namely, the I2C controller and a NIC card) were simple devices configured by simple register writes. The host did not pass any shared memory to the device, so I used a simple technique. I patched QEMU and introduced "fake" PCI devices with VID and PID of the devices I wanted to "pass-through". By default the MMIO read access return zero (or some other default value) while writes are ignored. Then I went on and added the real device access. Since I was running QEMU on Linux, I just blacklisted and rmmod'ed the corresponding drivers. Then, I used "lspci" to find the addresses of the PCI BARs and mmap() syscall on "/dev/mem" to access the device memory.

A picture is worth a thousand words


The code.

The code is quite funky, but thanks to the combination of many factors (X86 being a strongly ordered architecture and the code using only 4-byte transfers) the naive implementation worked fine.
1. I can now trace all register reads and writes
2. I have verified that the "good" firmware still boots and everything is working fine.
So, after that I simply ran all three firmwares and compared the register traces only to find out that the problems might be unrelated to driver code (which is a good result because it allowed to narrow down the amount of code to explore).

The branch with my customized QEMU is here. Also, I figured QEMU now has the ultra-useful option called "readconfig" which allows you to supply the board configuration in a text file and avoid messing with cryptic command line switches for enabling PCI devices or HDDs.
https://github.com/astarasikov/qemu/tree/our_board
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7

The code for mmapping the PCI memory is here. It is a marvel of YOLO-based design, but hey, X11 does the same, so it is now industry standard and unix-way :)
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7#diff-febf7a335ad7cd658784899e875b5cc7R31

Limitations.

There is one obvious limitation, which is also the reason QEMU generally does not support passthrough of MMIO devices on systems without IOMMU (Like VT-D or SMMU).

Imagine a case where the device is not only configured via simple register transfers (such as setting certain bits to enable IRQ masking or clock gating), but system memory is passed to the device via storing a pointer in a register.

The problem is that the address coming from the guest VM (GPA, guest physical address) will not usually match the corresponding HPA (host physical address). The device however operates the bus addresses and is not aware of the MMU, let alone VMs. IOMMU solves this problem by translating the addresses coming from the device into virtual addresses, and the hypervisor can supply a custom page table to translate bus HPAs into GPAs.

So if one would want to provide a pass-through solution for sharing memory in systems without IOMMU, one would either need to ensure that the VM and guest have 1:1 GPA/HPA mappings for the addresses used by the device or come up with an elaborate scheme which would detect register writes containing memory buffers and set up a corresponding buffer in the host memory space, but then one will need to deal with keeping the host and guest buffers consistent. Besides, that would require the knowledge of the device registers, so it would be non-trivial to build a completely generic, plug-and-play solution.

Sunday, November 8, 2015

Porting legacy RTOS software to a custom OS.

So lately I've been working on porting a large-sized application from a well-known RTOS to a custom kernel. The RTOS is VxWorks and the application is a router firmware. Unfortunately I will not be able to share any code of the porting layer. I'll just briefly go over the problems I've endeavoured and some thoughts I have come up with in the process.

The VxWorks compiler (called "diab") is like many other proprietary C compilers based on the Edison frontend. Unlike GCC or clang, by default it is quite lax when it comes to following the standards. Besides, the original firmware project uses the VxWorks Workbench IDE (which is of course a fork of Eclipse) and runs on Windows.

The first thing I had to do was to convert the project build system to Linux. The desirable way would be to write a parser that would go over the original build system and produce a Makefile. The advantage of this approach would be the ability to instantly obtain a new Makefile when someone changes the original project code. However, during the prototyping stage it makes a lot of sense to take shortcuts wherever possible, so I ended up writing a tiny wrapper program to replace the compiler and linker binaries. It would intercept its arguments, write them out to a log file and act as a proxy launching the original program. After that it was just a matter of making minor changes to the log file to convert it to a bash script and later on to a Makefile.

As an effect of the development done on Windows, the majority of files have the same set of defects that prevent them from being compiled by a linux-based compiler: paths in the "#include" directive often have incorrect case, reverse slashes and so on. Worst is that some of the "#include" directives included multiple files in one set of brackets. Luckily, this was easy to fix with a script that parsed the compilation log for the "file not found" errors, looked for the corresponding file ignoring the case and fixed up the source code. In the end I was left with about a dozen places that had to be fixed manually.

Implementing the compatibility layer.

I have done a quick research into the available options and saw that the only up-to-date solution implementing the VxWorks API on other OSs is "Xenomai". However, it is quite intrusive because it relies on a loadable kernel module and some patches to the linux kernel to function. Since we were not interested in getting realtime behaviour but wanted to run on both our OS and Linux and entirely in userspace, I decided to write yet another VxWorks emulation layer.
The original firmware comes as a single ELF file which is reasonable because in VxWorks all processes are implemented as threads in a shared address space. Besides, VxWorks provides a POSIX-compatible API for developers. So in order to identify which functions needed implementation it was enough to try linking the compiled code into a single executable.

"One weird trick" useful for creating porting layers and DDEKits is the GCC/clang option "include" which allows you to prepend an include to absolutely all files compiled. This is super useful. You can use such an uber-header to place the definitions of the data types and function prototypes for the target platform. Besides, you can use it to hook library calls at compile-time.

One of the problem that took a great amount of time was implementing synchronization primitives. Namely, mutexes and semaphores. The tricky semantic difference between semaphores and mutexes in VxWorks is that the latter are recursive. That means that once a thread has acquired a mutex, it is allowed to lock it any number of times as long as the lock/unlock count is balanced.
Before I realized this semantic difference, I couldn't figure out why the software would always lock up, and disabling locking altogether led to totally random crashes.

Ultimately I became frustrated and ended up with a simple implementation of a recursive mutex that has allowed me to move much further (Simple Recursive Mutex Implementation @ github). Later for the purposes of debugging I also added the options to print backtrace indicating the previous lock owner when trying to enter the critical section or when the spinlock took too many attempts.

Hunting for code defects

Uninitialized variables and what can we do about them

One problem I came across was that the code had a lot of uninitialized variables, hundreds of them. On the one hand, the proper solution is to inspect each case manually and write the correct initializer. On the other hand, the code works when compiled with the diab compiler so it must zero-initialize them.
So I went ahead and wrote a clang rewriter plugin to add the initializers: zero for the primitive types and a pair of curly braces for structs. (clang rewriter to add initializers). However, I realized that the biggest problem is that some functions use the convention of returning zero for indicating a failure while other return a non-zero value. This means, we cannot have a generic safe initializer that would make the code take the "fault" path when reaching the rewritten code. An alternative to manual inspection could be writing sophisticated rules for the rewriter to detect the convention used.

I ended up using valgrind and manually patching some warnings. AddressSanitizer was also useful. However fixing each warning and creating a blacklist is too tiresome. I ended up setting the breakpoint on the "__asan_report_error" function and a script that would make gdb print backtrace, return and continue execution.

Duplicate structures

One problem I supposed could be present in the code (due to the deep hirarchy of #ifdefs) is the presence of the structures with the same name but different content. I made up an example of a C program to demonstrate an effect where the compiler does not warn about the type mimatch but at runtime the code silently corrupts memory.


I figured out an easy way of dealing with the problem. I ended up using clang and emitting LLVM bittecode for each file instead of the object files with machine code. Then I linked them together into a single bitcode file and disassembled with llvm-dis.

The nice thing about llvm is that when it finds two structures having the same name but declared differently, it would append a different numeric suffix to the struct name. Then one could just remove the suffixes and look for unique lines with different structure declarations.

Luckily for me, there was only one place where I supposed an incorrect definition, and it was not in the part of the code I was executing, so I ruled out this option as a source of incorrect behavior.

Further work

Improving tooling for the 32-bit world

It is quite unfortunate but the project I have been working on is 32-bit. And one cannot simply convert it into a 64-bit one by a compiler flag. The problem is that the code has quite a lot of places where pointers are for some reason stored into packed structures and some other structures implicitly rely on the structure layout. So it is very difficult to modify this fragile mess.
It is sad that two great tools, MemorySanitizer and ThreadSanitizer, are not available for the 32-bit applications. It is understandable because the 32-bit address space is too tiny to fit the shadow memory. I am thinking of ways to make them available for the 32-bit world. So far I see two ways of solving the problem.
First, we can use the fragile and non-portable flag for the mmap (which is currently only supported on linux) to force allocations to below the 4gig limit. Then one could write an ELF loader that would load all the code below 4gigs, and use the upper memory range for shadows. Besides being non-portable, the other disadvantages of this approach include having to deal with 64-bit identifiers such as file handles.
Alternatively we could store the shadow in a separate process and use shared memory or sockets for communication. That would likely be at least an order slower than using the corresponding sanitizer in a 64-bit world, but likely still faster than valgrind and besides it is a compile-time instrumentation with more internal data from the compiler.

Verifying the porting was done correctly

Now I am left with one challenging task: verify the ported SW is identical to what could be built using the original build system.

One may notice that simply intercepting the calls to the compiler may not be enough because the build system may copy the files or export some shell variables during the build process. Besides, different compilers have different ways of handling "inline" and some other directives. It would be good to verify that the call graph of the original binary and the one produced by our Makefile is similar. (of course we will need to mark some library functions as leaf nodes and not analyze them). For a start I could try inspecting some of the unresolved symbols manually, but I'm thinking of automating the process. I think for this task I'll need a decompiler that can identify basic blocks. Probably Capstone engine should do the job. Any ideas on that?

P.S. Oh, and I once tried visualizing the dependency graph of separate ".o" files (before I realized I could just link them altogether and get the list of missing symbols) and trust me those graphs grow really fast. I have found out that a tool called "gephi" does a decent job at visualizing really huge graphs and supports Graphviz's dot as the input format.

EDIT 2015-11-10
The challenge is complicated by the fact that there some subprojects have multiple copies (with various changes) and one should also ensure that the headers and sources are picked up from the correct copy. However, I've found an easy and acceptable solution. I just wrote a tool that parses the call graph generated by IDA and for every edge in the graph it looks up the function names of the corresponding vertices. Then it just prints a list of pairs "A -> B" for every function A calling function B. After that, we can sort the file alphabetically and remove the nodes that are uninteresting to us (the OS and library functions). Next, we can compare the files side-by-side with a tool like kdiff3 (or we can do it automatically). Whenever there is a significant difference (for example, 5 callees are different for the same caller), we can inspect manually and verify we're compiling the correct file with the correct options. Using this method I have identified several places where we chose the wrong object file for linking and now we're only concerned with the porting layer and OS kernel, without having to worry about the application itself.

Wednesday, May 20, 2015

on making a DDE kit, kinda

So I have this task of porting a huge piece of software running on a proprietary OS to another OS. And I don't even have a clue how to compile it (well I do but it builds on windows so it's almost irrelevant).

But luckily all code is linked into a single ELF file and the compilation produces intermediate object files. The first thought I had was to visualize the dependency graph of object files to find out who calls what. You can find the script below that will recursively walk the supplied directory and try to parse the import/export table with objdump. There are some areas for improvement (for example, parsing also the dynsym table with -T or parsing .a archives) but it did the work for me.

Unfortunately I realized that visualizing the graph with 30K edges was not even remotely a smart idea
https://gist.github.com/astarasikov/355bf825f130fe4b5633

What I've also found out was that the OS-specific code and objects was stored in a separate location (since it was a part of an SDK). Even if it were not, we could just remove those object files that were both present in our project and in the SDK. After that, all the functions that the application was requiring from the OS unsurprisingly ended up in the "UNDEFINED" node and there were only 200 of them which gives me some hope.

This approach can also be used for other use-cases. For example, porting drivers from Linux/FreeBSD to exotic platforms - first build the binaries, then pick as many of them as you can to minimize the required functions list. I find dealing with compiled code easier because build systems, C macros and ifdefs just drive me insane.

Friday, May 15, 2015

GCOV is amazing yet undocumented

One useful technique for maintaining software quality is code coverage. While routinely used by high-level developers it is completely forgotten by many C hackers, especially when it comes to kernel. In fact, Linux is the only kernel which supports being compiled with the GCOV coverage tool.



GCOV works by instrumenting your code. It inserts some code to increment the stats counters around each basic block. These counters reside in a special section of your ELFs. During compilation, GCC generates a ".gcno" file for each ".c" file. These files allow the "gcov" tool to lookup function names and other info using the integer IDs (which are specific to each file).

At runtime, an executable built with GCOV produces a file called ".gcda" which contains the values of the counters. During the executable initialization, constructors (which are function pointers in the ".ctors" section) are called. One of them is "__gcov_init" which registers a certain ".o" file inside libgcov.

Since we're running in kernel or "bare-metal", we don't have neither libgcov nor file system to dump the ".gcda" files. But one should not fear GCOV! Adding it to your kernel is just a matter of adding one C file (which is mostly shamelessly copy-pasted from linux kernel and gcc sources) and a couple CFLAGS. In my example I'm using the LK kernel by Travis Geiselbrecht (https://github.com/travisg/lk). I've decided to just print out the contents of the ".gcda" files to the serial console (UART) as hex dump and then use an AWK script and the "xxd" tool to convert them to binaries on the host. This works completely fine since these files are typically below 2KB in size.

An important thing to notice: if your kernel does not contain the ".ctors" section and does not call the constructors, be sure to add them to the ld script and add some code to invoke them. For example, here's how LK does that: https://github.com/travisg/lk/blob/master/top/main.c#L42

You can see the whole patch below.
https://github.com/astarasikov/lk/commit/2a4af09a894194dfaff3e05f6fd505241d54d074

After running the "gcovr" tool you can get a nice HTML with summary and see which lines were executed and which were not and add the tests for the latter. Woot!

Friday, October 24, 2014

abusing Mesa by hooking ELFs and ioctl

At work I had several occasions when I needed to hook a function on a Linux system. I will not go deep into technical details but the examples provided in the links contain the code which can be easily plugged into a project.

A simple case.
First let us consider a case when we only need to replace a function in our application without replacing it in dynamic libraries loaded by the application.

In Linux, functions in dynamic libraries are resolved using two tables: PLT (procedure linkage table) and GOT (global offset table). When you call a library function, you actually call a stub code in the PLT. What PLT essentially does is loading the function address from the GOT. On an X86_64, it does that with the "jmpq *" instruction (ff 25). Initially GOT contains the pointer back to PLT, which will invoke the resolver which will write the correct entry to GOT.

So, to replace a library function, we must patch the entry in GOT. How to find it? We can decode the offset to GOT (which is the relative to PLT + JMPQ instruction length) from the JMPQ instruction. How do we find the PLT then? It's simply the function pointer - that is, in C, use the function name as a pointer.

We should call "dlsym" to forcibly resolve the entry in GOT before replacing it. For example, if we want to swap two dynamic functions. A complete example is available as a github gist:
https://gist.github.com/astarasikov/9547918

A more advanced example
The approach described above will not work when you want to replace a function that is used by some dynamically loaded binary since the library will have its own set of GOT and PLT tables.

On X86_64, we can find the PLT table by examining the contents of the ELF file. The section we care about is called ".rela.plt".

We need to example the ELF header (Ehdr), find the section header (Shdr) and find the "dynsym" and "rela.plt" tables. The dynsym table unsurprisingly contains the pointers to dynamically resolved functions and can be used to find the function by name. You can find the link to the code below. I've mmap'ed the library file to get a pointer to its contents and use it to read the data. Some examples floating around the internet are using the "malloc", "read()" and "memcpy" calls for the same purpose. To get the pointer to the library in the application's virtual memory space, you can call "dlopen" and dereference the pointer returned by the function. You will need this address to convert the PLT offset into the address. Technically you can get away without reading/mmap'ing the library from disk, but since some sections are not mapped for reading, you will need to use "mprotect" to access them without receiving SIGSEGV.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/hook-elf.c#L126

Alternatives
The most straightforward alternative for such runtime patching is using the "LD_PRELOAD" variable and specify a library with the implementation of the hooked routine. The linker will then redirect all calls to that function in all libraries into the preloaded library. However, the obvious limitation of this approach is that it may be hard to get it working if you preload multiple libraries overriding the same symbol. It breaks some tools like "apitrace" (which is a tool to trace and replay OpenGL calls).

Practical example with Mesa
Many GPUs nowadays contain encoders which will encode video to H.264 or other popular formats. Intel GPUs (namely HD4000 and Iris Graphics series) have an encoder for H.264. Some solutions like NVIDIA Grid utilize hardware capabilities to stream games over the network or provide the remote desktop facilities with low CPU load.

Unfortunately both proprietary and the open-source drivers for the Intel hardware lack the ability to export and OpenGL resource into the encoder library which is a desired option for implementing a screen recorder (actually the proprietary driver from the SDK implements the advanced encoding algorithm as a binary library but uses the pre-compiled versions of the vaapi and libdrm for managind resources).

If we don't share a GL texture with the encoder, we cannot use the zero-copy path. The texture contents will need to be read with "glReadPixels()" and then converted to the NV12 or YUV420 format (though it can be done by a GLSL shader before reading the data) and re-uploaded to the encoder. This is not an acceptable solution since each frame will take more than 30ms and cause 80% CPU thus leaving no processing power for other parts of the application. On the other hand, using a zero-copy approach will allow us to have a smooth 60FPS performance - the OpenGL side will not be blocked by "glReadPixels()" and CPU load will never exceed 10 percent.

To be more precise, resource sharing support is present if one uses the "EGL" windowing system and the one of the MESA extensions. Technically, all GPU/encoder memory is managed by the DRM framework in Linux kernel. There is an extension which allows to obtain a DRM handle (which is an uint32) from an OpenGL texture. It is used by Wayland in the Weston display server. An example of using VAAPI to encode a DRM handle can be found in Weston source code:
http://cgit.freedesktop.org/wayland/weston/tree/src/vaapi-recorder.c

Now, the problem is to find a handle of an OpenGL texture. Under EGL that can be done by creating an EGLImage from the texture handle and subsequently calling eglExportDRMImageMESA.

In my case the problem was that I didn't want to use EGL because it is quite difficult to port a lot of legacy GLX codebase to EGL. Besides, GLX support is more mature and stable with the NVIDIA binary driver. So the problem is to get a DRM buffer out of a GL texture in GLX.

Unfortunately GLX does not provide an equivalent extension and implementing one for Mesa and X11 is rather complicated due to the complexity of the code. One option would be to link against Mesa and use its internal headers to reach for the needed resources as done by Beignet, the open-source OpenCL driver for Intel (http://cgit.freedesktop.org/beignet/tree/src/intel/intel_dri_resource_sharing.c) which is error-prone. Besides, setting up such a build environment is complicated.

Luckily, as we've seen above, the memory is managed by the DRM which means every memory allocation is actually an IOCTL call to kernel. Which means we can hook the ioctl and steal the DRM handle and later use it as the input to VAAPI. We need to look for the DRM_I915_GEM_SET_TILING ioctl code.

The obvious limitation is that you will need to store a global reference which will be written by the ioctl hook which makes the code thread-unsafe. Luckily most OpenGL apps are single-threaded and even when they're not, there are less than dozen threads which can access OpenGL resources so lock contention is not an issue and a pthread mutex can be used.

Another limitation is that to avoid memory corruption or errors, we need to carefully track the resources. One solution would be to allocate OpenGL texture prior to using VAAPI and deallocate after the encoder is destroyed. The other one is to hook more DRM calls to get a pointer to the "drm_intel_bo" structure and increase its reference count.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/demo1_cube.cc#L119

Sunday, September 7, 2014

GSoC results or how I nearly wasted summer

This year I got a chance to participate in the Google Summer of Code with the FreeBSD project.
I have long wanted to learn about the FreeBSD kernel internals and port it to some ARM board but was getting sidetracked whenever I tried to do it. So it looked like a perfect chance to stop procrastinating.

Feel free to scroll down to the "other stuff" paragraph if you don't feel like reading thre paragraphs of a typical whinig associated with debugging C code.

Initially I wanted to bring up some real piece of hardware. My choice was the Nexus 5 phone which has Qualcomm SoC. However, some of the FreeBSD developers suggested that instead I port the kernel to Android Emulator. It sounded like a sane idea since Android Emulator is available for the major OSes and is easy to set up which will allow to expose more people to the project. Besides, unlike QEMU, it can emulate some peripherals specific to a mobile device such as sensors.

What is Android Emulator? In fact, it is an emulator based on QEMU which emulates a virtual board called "Goldfish" which includes a set of devices such as interrupt controller, timer, input devices. It is worth noting that Android Emulator is based on an ancient version of QEMU though starting with Android L one can obtain the binaries of the emulator from git which is based on a more recent build.


I have started from implementing the basics: the interrupt controller, the timer and the UART driver. That has allowed me to see the boot console but then I got stuck for literally months with kernel crashing at various points in what seemed like a totally random fashion. I have spent nights single-stepping the kernel with my sight slowly turning from barely red eyes to emitting hard radiation. It was definitely a problem with caching but ultimately I gave up trying to fix it. Since I knew that FreeBSD kernel worked fine on ARM in a recent version of QEMU, it was clearly a bug in Android Emulator, even though it has not manifested itself in Linux. Since Android Emulator is going to eventually get updated to a new QEMU base and I was running out of time, I decided to just use a nightly build of emulator instead of the one coming with the Android SDK and that fixed this particular issue. 
But hey! At least I've learnt about the FreeBSD virtual memory management and the UMA allocator the hard way.

Next up was fighting the timer and random freezes while initializing the random number generator (sic!). That was easy - turns out I just forgot to call the function. Anyway it would make sense to move that call out of the board files and call it for every board that does not define the "NO_EVENTTIMERS" option in the kernel config.

Fast forward to the start of August when there are only a couple days left and I still don't have a working block device driver to boot rootfs! Writing a MMC driver turned out to be very easy and I started celebrating when I got the SD card image from Raspberry PI booting in Android Emulator and showing the login prompt.

It worked but with a huge kludge in the OpenFirmware bindings. Somehow one of the functions which should've returned the driver bus name returned (seemingly) random crap so instead of a NULL pointer check I had to add a check that the address points to the kernel VM range. I have then tracked down the real source of the problem.
I was basing my MMC driver on the LPC SoC driver. LPC driver itself is suspicious - for example, the place where DMA memory is allocated says "allocating Framebuffer memory" which may be an indicator that it was written in a hurry and possibly only barely tested. At the MMC bus attach routine, it was calling the "device_set_ivars" function and setting the ivars pointer to the mmc host structure. I have observed the similiar pattern in some other drivers. In the OpenFirmware code, the ivars structure was dereferenced as a completely other structure and a string was taken from a member of this structure.

How the hell did it happen in a world where people are using the recent Clang compiler and compile with "-Wall" and "-Werror"? Uhh, well, if you're dealing with some kind of OOP in plain C, casting to void pointer and back and other ill-typed wicked tricks are inevitable. Just look at that prototype and cry with me:

>> device_set_ivars(device_t dev, void *ivar);

For now, I'm pushing the changes to the following branch. Please note that I'm rebasing against master and force-pushing in process if you're willing to test.
https://github.com/astarasikov/freebsd/tree/android_goldfish_arm_master
So, what has been achieved and what am I still going to do? Well, debugging MMU issues and OpenFirmware bug has delayed me substantially so I've not done everything I've planned. Still
  • Basic hardware works in Goldfish: Timer, IRQs, Ethernet, MMC, UART
  • It is using NEWCONS for Framebuffer Console (which is cool!)
  • I've verified that Android Emulator works in FreeBSD with linuxulator
  • I have fixed Android Emulator to compile natively on FreeBSD and it mostly works (at least, console output) except graphics (which I think should be an easy fix. there are a lot of places where "ifdef __linux__" is completely unjustified and should rather be enabled for all unices)
How to test and contribute?
You can use the linux build of Android Emulator using Linuxulator. Unfortunately, linuxulator only supports 32-bit binaries so we cannot run nightly binaries which have the MMU issues fixed.
Therefore, I have fixed the emulator to compile natively on FreeBSD! I've pushed the branch named "l-preview-freebsd" to github. You will also need the gtest. I cloned the repo from CyanogenMod and pushed a tag named "l-preview-freebsd", which is actually "cm-11.0".

Compiling the emulator:
>> git clone https://github.com/astarasikov/qemu.git
>> https://github.com/astarasikov/android_external_gtest
>> cd cd android_external_gtest
>> git checkout cm-11.0
>> cd ../qemu.git
>> git checkout l-preview-freebsd
>> bash android-build-freebsd.sh


Please refer to README file in git for the detailed instructions on building kernel.

The only thing that really bothers me about low-level and embedded stuff is that it is extremely difficult to estimate how long it may take to implement a certain feature because every time you end up debugging obscure stuff and the "useful stuff you learn or do" / "time wasted" ratio is very small. On a bright side, while *BSD lag a bit behind Linux in terms of device drivers and performance optimizations, reading and debugging *BSD code is much easier than GNU projects like EGLIBC, GCC and GLib.

Other stuff.
Besides GSoC at the start of summer I had a chance to work with Chris Wade (who is an amazing hacker, by the way). We started working on an ARM virtualization project and have spent nice days debugging caching issues and instruction decoding while trying to emulate a particular cortex A8 SoC on an A15 chip. Unfortunately working on GSoC, going to a daily job and doing this project simultaneously turned out to be surprisingly difficult and I had to quit at least one of the activities. Still I wish Chris good luck with this project and if you're interested in virtualization and iPhones, sign up for early demo at virtu.al

Meanwhile I'm planning to learn ARMv8 ISA. It's a pity there is no hackable hardware available for reasonable prices yet. QEMU works fine with VirtIO peripherals though. But I'm getting increasingly worried about devkits costing more and more essentially making it more difficult for a novice to become an embedded software engineer.