Friday, October 24, 2014

abusing Mesa by hooking ELFs and ioctl

At work I had several occasions when I needed to hook a function on a Linux system. I will not go deep into technical details but the examples provided in the links contain the code which can be easily plugged into a project.

A simple case.
First let us consider a case when we only need to replace a function in our application without replacing it in dynamic libraries loaded by the application.

In Linux, functions in dynamic libraries are resolved using two tables: PLT (procedure linkage table) and GOT (global offset table). When you call a library function, you actually call a stub code in the PLT. What PLT essentially does is loading the function address from the GOT. On an X86_64, it does that with the "jmpq *" instruction (ff 25). Initially GOT contains the pointer back to PLT, which will invoke the resolver which will write the correct entry to GOT.

So, to replace a library function, we must patch the entry in GOT. How to find it? We can decode the offset to GOT (which is the relative to PLT + JMPQ instruction length) from the JMPQ instruction. How do we find the PLT then? It's simply the function pointer - that is, in C, use the function name as a pointer.

We should call "dlsym" to forcibly resolve the entry in GOT before replacing it. For example, if we want to swap two dynamic functions. A complete example is available as a github gist:
https://gist.github.com/astarasikov/9547918

A more advanced example
The approach described above will not work when you want to replace a function that is used by some dynamically loaded binary since the library will have its own set of GOT and PLT tables.

On X86_64, we can find the PLT table by examining the contents of the ELF file. The section we care about is called ".rela.plt".

We need to example the ELF header (Ehdr), find the section header (Shdr) and find the "dynsym" and "rela.plt" tables. The dynsym table unsurprisingly contains the pointers to dynamically resolved functions and can be used to find the function by name. You can find the link to the code below. I've mmap'ed the library file to get a pointer to its contents and use it to read the data. Some examples floating around the internet are using the "malloc", "read()" and "memcpy" calls for the same purpose. To get the pointer to the library in the application's virtual memory space, you can call "dlopen" and dereference the pointer returned by the function. You will need this address to convert the PLT offset into the address. Technically you can get away without reading/mmap'ing the library from disk, but since some sections are not mapped for reading, you will need to use "mprotect" to access them without receiving SIGSEGV.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/hook-elf.c#L126

Alternatives
The most straightforward alternative for such runtime patching is using the "LD_PRELOAD" variable and specify a library with the implementation of the hooked routine. The linker will then redirect all calls to that function in all libraries into the preloaded library. However, the obvious limitation of this approach is that it may be hard to get it working if you preload multiple libraries overriding the same symbol. It breaks some tools like "apitrace" (which is a tool to trace and replay OpenGL calls).

Practical example with Mesa
Many GPUs nowadays contain encoders which will encode video to H.264 or other popular formats. Intel GPUs (namely HD4000 and Iris Graphics series) have an encoder for H.264. Some solutions like NVIDIA Grid utilize hardware capabilities to stream games over the network or provide the remote desktop facilities with low CPU load.

Unfortunately both proprietary and the open-source drivers for the Intel hardware lack the ability to export and OpenGL resource into the encoder library which is a desired option for implementing a screen recorder (actually the proprietary driver from the SDK implements the advanced encoding algorithm as a binary library but uses the pre-compiled versions of the vaapi and libdrm for managind resources).

If we don't share a GL texture with the encoder, we cannot use the zero-copy path. The texture contents will need to be read with "glReadPixels()" and then converted to the NV12 or YUV420 format (though it can be done by a GLSL shader before reading the data) and re-uploaded to the encoder. This is not an acceptable solution since each frame will take more than 30ms and cause 80% CPU thus leaving no processing power for other parts of the application. On the other hand, using a zero-copy approach will allow us to have a smooth 60FPS performance - the OpenGL side will not be blocked by "glReadPixels()" and CPU load will never exceed 10 percent.

To be more precise, resource sharing support is present if one uses the "EGL" windowing system and the one of the MESA extensions. Technically, all GPU/encoder memory is managed by the DRM framework in Linux kernel. There is an extension which allows to obtain a DRM handle (which is an uint32) from an OpenGL texture. It is used by Wayland in the Weston display server. An example of using VAAPI to encode a DRM handle can be found in Weston source code:
http://cgit.freedesktop.org/wayland/weston/tree/src/vaapi-recorder.c

Now, the problem is to find a handle of an OpenGL texture. Under EGL that can be done by creating an EGLImage from the texture handle and subsequently calling eglExportDRMImageMESA.

In my case the problem was that I didn't want to use EGL because it is quite difficult to port a lot of legacy GLX codebase to EGL. Besides, GLX support is more mature and stable with the NVIDIA binary driver. So the problem is to get a DRM buffer out of a GL texture in GLX.

Unfortunately GLX does not provide an equivalent extension and implementing one for Mesa and X11 is rather complicated due to the complexity of the code. One option would be to link against Mesa and use its internal headers to reach for the needed resources as done by Beignet, the open-source OpenCL driver for Intel (http://cgit.freedesktop.org/beignet/tree/src/intel/intel_dri_resource_sharing.c) which is error-prone. Besides, setting up such a build environment is complicated.

Luckily, as we've seen above, the memory is managed by the DRM which means every memory allocation is actually an IOCTL call to kernel. Which means we can hook the ioctl and steal the DRM handle and later use it as the input to VAAPI. We need to look for the DRM_I915_GEM_SET_TILING ioctl code.

The obvious limitation is that you will need to store a global reference which will be written by the ioctl hook which makes the code thread-unsafe. Luckily most OpenGL apps are single-threaded and even when they're not, there are less than dozen threads which can access OpenGL resources so lock contention is not an issue and a pthread mutex can be used.

Another limitation is that to avoid memory corruption or errors, we need to carefully track the resources. One solution would be to allocate OpenGL texture prior to using VAAPI and deallocate after the encoder is destroyed. The other one is to hook more DRM calls to get a pointer to the "drm_intel_bo" structure and increase its reference count.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/demo1_cube.cc#L119