At work I had several occasions when I needed to hook a function on a Linux system. I will not go deep into technical details but the examples provided in the links contain the code which can be easily plugged into a project.
A simple case.
First let us consider a case when we only need to replace a function in our application without replacing it in dynamic libraries loaded by the application.
In Linux, functions in dynamic libraries are resolved using two tables: PLT (procedure linkage table) and GOT (global offset table). When you call a library function, you actually call a stub code in the PLT. What PLT essentially does is loading the function address from the GOT. On an X86_64, it does that with the "jmpq *" instruction (ff 25). Initially GOT contains the pointer back to PLT, which will invoke the resolver which will write the correct entry to GOT.
So, to replace a library function, we must patch the entry in GOT. How to find it? We can decode the offset to GOT (which is the relative to PLT + JMPQ instruction length) from the JMPQ instruction. How do we find the PLT then? It's simply the function pointer - that is, in C, use the function name as a pointer.
We should call "dlsym" to forcibly resolve the entry in GOT before replacing it. For example, if we want to swap two dynamic functions. A complete example is available as a github gist:
https://gist.github.com/astarasikov/9547918
A more advanced example
The approach described above will not work when you want to replace a function that is used by some dynamically loaded binary since the library will have its own set of GOT and PLT tables.
On X86_64, we can find the PLT table by examining the contents of the ELF file. The section we care about is called ".rela.plt".
We need to example the ELF header (Ehdr), find the section header (Shdr) and find the "dynsym" and "rela.plt" tables. The dynsym table unsurprisingly contains the pointers to dynamically resolved functions and can be used to find the function by name. You can find the link to the code below. I've mmap'ed the library file to get a pointer to its contents and use it to read the data. Some examples floating around the internet are using the "malloc", "read()" and "memcpy" calls for the same purpose. To get the pointer to the library in the application's virtual memory space, you can call "dlopen" and dereference the pointer returned by the function. You will need this address to convert the PLT offset into the address. Technically you can get away without reading/mmap'ing the library from disk, but since some sections are not mapped for reading, you will need to use "mprotect" to access them without receiving SIGSEGV.
https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/hook-elf.c#L126
Alternatives
The most straightforward alternative for such runtime patching is using the "LD_PRELOAD" variable and specify a library with the implementation of the hooked routine. The linker will then redirect all calls to that function in all libraries into the preloaded library. However, the obvious limitation of this approach is that it may be hard to get it working if you preload multiple libraries overriding the same symbol. It breaks some tools like "apitrace" (which is a tool to trace and replay OpenGL calls).
Practical example with Mesa
Many GPUs nowadays contain encoders which will encode video to H.264 or other popular formats. Intel GPUs (namely HD4000 and Iris Graphics series) have an encoder for H.264. Some solutions like NVIDIA Grid utilize hardware capabilities to stream games over the network or provide the remote desktop facilities with low CPU load.
Unfortunately both proprietary and the open-source drivers for the Intel hardware lack the ability to export and OpenGL resource into the encoder library which is a desired option for implementing a screen recorder (actually the proprietary driver from the SDK implements the advanced encoding algorithm as a binary library but uses the pre-compiled versions of the vaapi and libdrm for managind resources).
If we don't share a GL texture with the encoder, we cannot use the zero-copy path. The texture contents will need to be read with "glReadPixels()" and then converted to the NV12 or YUV420 format (though it can be done by a GLSL shader before reading the data) and re-uploaded to the encoder. This is not an acceptable solution since each frame will take more than 30ms and cause 80% CPU thus leaving no processing power for other parts of the application. On the other hand, using a zero-copy approach will allow us to have a smooth 60FPS performance - the OpenGL side will not be blocked by "glReadPixels()" and CPU load will never exceed 10 percent.
To be more precise, resource sharing support is present if one uses the "EGL" windowing system and the one of the MESA extensions. Technically, all GPU/encoder memory is managed by the DRM framework in Linux kernel. There is an extension which allows to obtain a DRM handle (which is an uint32) from an OpenGL texture. It is used by Wayland in the Weston display server. An example of using VAAPI to encode a DRM handle can be found in Weston source code:
http://cgit.freedesktop.org/wayland/weston/tree/src/vaapi-recorder.c
Now, the problem is to find a handle of an OpenGL texture. Under EGL that can be done by creating an EGLImage from the texture handle and subsequently calling eglExportDRMImageMESA.
In my case the problem was that I didn't want to use EGL because it is quite difficult to port a lot of legacy GLX codebase to EGL. Besides, GLX support is more mature and stable with the NVIDIA binary driver. So the problem is to get a DRM buffer out of a GL texture in GLX.
Unfortunately GLX does not provide an equivalent extension and implementing one for Mesa and X11 is rather complicated due to the complexity of the code. One option would be to link against Mesa and use its internal headers to reach for the needed resources as done by Beignet, the open-source OpenCL driver for Intel (http://cgit.freedesktop.org/beignet/tree/src/intel/intel_dri_resource_sharing.c) which is error-prone. Besides, setting up such a build environment is complicated.
Luckily, as we've seen above, the memory is managed by the DRM which means every memory allocation is actually an IOCTL call to kernel. Which means we can hook the ioctl and steal the DRM handle and later use it as the input to VAAPI. We need to look for the DRM_I915_GEM_SET_TILING ioctl code.
The obvious limitation is that you will need to store a global reference which will be written by the ioctl hook which makes the code thread-unsafe. Luckily most OpenGL apps are single-threaded and even when they're not, there are less than dozen threads which can access OpenGL resources so lock contention is not an issue and a pthread mutex can be used.
Another limitation is that to avoid memory corruption or errors, we need to carefully track the resources. One solution would be to allocate OpenGL texture prior to using VAAPI and deallocate after the encoder is destroyed. The other one is to hook more DRM calls to get a pointer to the "drm_intel_bo" structure and increase its reference count.
https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/demo1_cube.cc#L119
Friday, October 24, 2014
Sunday, September 7, 2014
GSoC results or how I nearly wasted summer
This year I got a chance to participate in the Google Summer of Code with the FreeBSD project.
I have long wanted to learn about the FreeBSD kernel internals and port it to some ARM board but was getting sidetracked whenever I tried to do it. So it looked like a perfect chance to stop procrastinating.
Feel free to scroll down to the "other stuff" paragraph if you don't feel like reading thre paragraphs of a typical whinig associated with debugging C code.
Initially I wanted to bring up some real piece of hardware. My choice was the Nexus 5 phone which has Qualcomm SoC. However, some of the FreeBSD developers suggested that instead I port the kernel to Android Emulator. It sounded like a sane idea since Android Emulator is available for the major OSes and is easy to set up which will allow to expose more people to the project. Besides, unlike QEMU, it can emulate some peripherals specific to a mobile device such as sensors.
What is Android Emulator? In fact, it is an emulator based on QEMU which emulates a virtual board called "Goldfish" which includes a set of devices such as interrupt controller, timer, input devices. It is worth noting that Android Emulator is based on an ancient version of QEMU though starting with Android L one can obtain the binaries of the emulator from git which is based on a more recent build.
I have started from implementing the basics: the interrupt controller, the timer and the UART driver. That has allowed me to see the boot console but then I got stuck for literally months with kernel crashing at various points in what seemed like a totally random fashion. I have spent nights single-stepping the kernel with my sight slowly turning from barely red eyes to emitting hard radiation. It was definitely a problem with caching but ultimately I gave up trying to fix it. Since I knew that FreeBSD kernel worked fine on ARM in a recent version of QEMU, it was clearly a bug in Android Emulator, even though it has not manifested itself in Linux. Since Android Emulator is going to eventually get updated to a new QEMU base and I was running out of time, I decided to just use a nightly build of emulator instead of the one coming with the Android SDK and that fixed this particular issue. But hey! At least I've learnt about the FreeBSD virtual memory management and the UMA allocator the hard way.
Next up was fighting the timer and random freezes while initializing the random number generator (sic!). That was easy - turns out I just forgot to call the function. Anyway it would make sense to move that call out of the board files and call it for every board that does not define the "NO_EVENTTIMERS" option in the kernel config.
Fast forward to the start of August when there are only a couple days left and I still don't have a working block device driver to boot rootfs! Writing a MMC driver turned out to be very easy and I started celebrating when I got the SD card image from Raspberry PI booting in Android Emulator and showing the login prompt.
It worked but with a huge kludge in the OpenFirmware bindings. Somehow one of the functions which should've returned the driver bus name returned (seemingly) random crap so instead of a NULL pointer check I had to add a check that the address points to the kernel VM range. I have then tracked down the real source of the problem.
I was basing my MMC driver on the LPC SoC driver. LPC driver itself is suspicious - for example, the place where DMA memory is allocated says "allocating Framebuffer memory" which may be an indicator that it was written in a hurry and possibly only barely tested. At the MMC bus attach routine, it was calling the "device_set_ivars" function and setting the ivars pointer to the mmc host structure. I have observed the similiar pattern in some other drivers. In the OpenFirmware code, the ivars structure was dereferenced as a completely other structure and a string was taken from a member of this structure.
How the hell did it happen in a world where people are using the recent Clang compiler and compile with "-Wall" and "-Werror"? Uhh, well, if you're dealing with some kind of OOP in plain C, casting to void pointer and back and other ill-typed wicked tricks are inevitable. Just look at that prototype and cry with me:
For now, I'm pushing the changes to the following branch. Please note that I'm rebasing against master and force-pushing in process if you're willing to test.
https://github.com/astarasikov/freebsd/tree/android_goldfish_arm_master
So, what has been achieved and what am I still going to do? Well, debugging MMU issues and OpenFirmware bug has delayed me substantially so I've not done everything I've planned. Still
You can use the linux build of Android Emulator using Linuxulator. Unfortunately, linuxulator only supports 32-bit binaries so we cannot run nightly binaries which have the MMU issues fixed.
Therefore, I have fixed the emulator to compile natively on FreeBSD! I've pushed the branch named "l-preview-freebsd" to github. You will also need the gtest. I cloned the repo from CyanogenMod and pushed a tag named "l-preview-freebsd", which is actually "cm-11.0".
Compiling the emulator:
>> git clone https://github.com/astarasikov/qemu.git
>> https://github.com/astarasikov/android_external_gtest
>> cd cd android_external_gtest
>> git checkout cm-11.0
>> cd ../qemu.git
>> git checkout l-preview-freebsd
>> bash android-build-freebsd.sh
I have long wanted to learn about the FreeBSD kernel internals and port it to some ARM board but was getting sidetracked whenever I tried to do it. So it looked like a perfect chance to stop procrastinating.
Feel free to scroll down to the "other stuff" paragraph if you don't feel like reading thre paragraphs of a typical whinig associated with debugging C code.
Initially I wanted to bring up some real piece of hardware. My choice was the Nexus 5 phone which has Qualcomm SoC. However, some of the FreeBSD developers suggested that instead I port the kernel to Android Emulator. It sounded like a sane idea since Android Emulator is available for the major OSes and is easy to set up which will allow to expose more people to the project. Besides, unlike QEMU, it can emulate some peripherals specific to a mobile device such as sensors.
What is Android Emulator? In fact, it is an emulator based on QEMU which emulates a virtual board called "Goldfish" which includes a set of devices such as interrupt controller, timer, input devices. It is worth noting that Android Emulator is based on an ancient version of QEMU though starting with Android L one can obtain the binaries of the emulator from git which is based on a more recent build.
I have started from implementing the basics: the interrupt controller, the timer and the UART driver. That has allowed me to see the boot console but then I got stuck for literally months with kernel crashing at various points in what seemed like a totally random fashion. I have spent nights single-stepping the kernel with my sight slowly turning from barely red eyes to emitting hard radiation. It was definitely a problem with caching but ultimately I gave up trying to fix it. Since I knew that FreeBSD kernel worked fine on ARM in a recent version of QEMU, it was clearly a bug in Android Emulator, even though it has not manifested itself in Linux. Since Android Emulator is going to eventually get updated to a new QEMU base and I was running out of time, I decided to just use a nightly build of emulator instead of the one coming with the Android SDK and that fixed this particular issue. But hey! At least I've learnt about the FreeBSD virtual memory management and the UMA allocator the hard way.
Next up was fighting the timer and random freezes while initializing the random number generator (sic!). That was easy - turns out I just forgot to call the function. Anyway it would make sense to move that call out of the board files and call it for every board that does not define the "NO_EVENTTIMERS" option in the kernel config.
Fast forward to the start of August when there are only a couple days left and I still don't have a working block device driver to boot rootfs! Writing a MMC driver turned out to be very easy and I started celebrating when I got the SD card image from Raspberry PI booting in Android Emulator and showing the login prompt.
It worked but with a huge kludge in the OpenFirmware bindings. Somehow one of the functions which should've returned the driver bus name returned (seemingly) random crap so instead of a NULL pointer check I had to add a check that the address points to the kernel VM range. I have then tracked down the real source of the problem.
I was basing my MMC driver on the LPC SoC driver. LPC driver itself is suspicious - for example, the place where DMA memory is allocated says "allocating Framebuffer memory" which may be an indicator that it was written in a hurry and possibly only barely tested. At the MMC bus attach routine, it was calling the "device_set_ivars" function and setting the ivars pointer to the mmc host structure. I have observed the similiar pattern in some other drivers. In the OpenFirmware code, the ivars structure was dereferenced as a completely other structure and a string was taken from a member of this structure.
How the hell did it happen in a world where people are using the recent Clang compiler and compile with "-Wall" and "-Werror"? Uhh, well, if you're dealing with some kind of OOP in plain C, casting to void pointer and back and other ill-typed wicked tricks are inevitable. Just look at that prototype and cry with me:
>> device_set_ivars(device_t dev, void *ivar);
For now, I'm pushing the changes to the following branch. Please note that I'm rebasing against master and force-pushing in process if you're willing to test.
https://github.com/astarasikov/freebsd/tree/android_goldfish_arm_master
So, what has been achieved and what am I still going to do? Well, debugging MMU issues and OpenFirmware bug has delayed me substantially so I've not done everything I've planned. Still
- Basic hardware works in Goldfish: Timer, IRQs, Ethernet, MMC, UART
- It is using NEWCONS for Framebuffer Console (which is cool!)
- I've verified that Android Emulator works in FreeBSD with linuxulator
- I have fixed Android Emulator to compile natively on FreeBSD and it mostly works (at least, console output) except graphics (which I think should be an easy fix. there are a lot of places where "ifdef __linux__" is completely unjustified and should rather be enabled for all unices)
You can use the linux build of Android Emulator using Linuxulator. Unfortunately, linuxulator only supports 32-bit binaries so we cannot run nightly binaries which have the MMU issues fixed.
Therefore, I have fixed the emulator to compile natively on FreeBSD! I've pushed the branch named "l-preview-freebsd" to github. You will also need the gtest. I cloned the repo from CyanogenMod and pushed a tag named "l-preview-freebsd", which is actually "cm-11.0".
Compiling the emulator:
>> git clone https://github.com/astarasikov/qemu.git
>> https://github.com/astarasikov/android_external_gtest
>> cd cd android_external_gtest
>> git checkout cm-11.0
>> cd ../qemu.git
>> git checkout l-preview-freebsd
>> bash android-build-freebsd.sh
Please refer to README file in git for the detailed instructions on building kernel.
The only thing that really bothers me about low-level and embedded stuff is that it is extremely difficult to estimate how long it may take to implement a certain feature because every time you end up debugging obscure stuff and the "useful stuff you learn or do" / "time wasted" ratio is very small. On a bright side, while *BSD lag a bit behind Linux in terms of device drivers and performance optimizations, reading and debugging *BSD code is much easier than GNU projects like EGLIBC, GCC and GLib.
Other stuff.
Besides GSoC at the start of summer I had a chance to work with Chris Wade (who is an amazing hacker, by the way). We started working on an ARM virtualization project and have spent nice days debugging caching issues and instruction decoding while trying to emulate a particular cortex A8 SoC on an A15 chip. Unfortunately working on GSoC, going to a daily job and doing this project simultaneously turned out to be surprisingly difficult and I had to quit at least one of the activities. Still I wish Chris good luck with this project and if you're interested in virtualization and iPhones, sign up for early demo at virtu.al
Meanwhile I'm planning to learn ARMv8 ISA. It's a pity there is no hackable hardware available for reasonable prices yet. QEMU works fine with VirtIO peripherals though. But I'm getting increasingly worried about devkits costing more and more essentially making it more difficult for a novice to become an embedded software engineer.
Other stuff.
Besides GSoC at the start of summer I had a chance to work with Chris Wade (who is an amazing hacker, by the way). We started working on an ARM virtualization project and have spent nice days debugging caching issues and instruction decoding while trying to emulate a particular cortex A8 SoC on an A15 chip. Unfortunately working on GSoC, going to a daily job and doing this project simultaneously turned out to be surprisingly difficult and I had to quit at least one of the activities. Still I wish Chris good luck with this project and if you're interested in virtualization and iPhones, sign up for early demo at virtu.al
Meanwhile I'm planning to learn ARMv8 ISA. It's a pity there is no hackable hardware available for reasonable prices yet. QEMU works fine with VirtIO peripherals though. But I'm getting increasingly worried about devkits costing more and more essentially making it more difficult for a novice to become an embedded software engineer.
Friday, June 6, 2014
On Free Software and if proprietary software should be considered harmful
I communicate with a lot of people on the internet and they have various opinions on FOSS ranging from "only proprietary software written by a renowned company can deliver quality" to "if you're using binary blobs, you're a dick". Since these issues arise very often in discussions, I think I need to write it up so I can just shove the link next time.
On the one hand, I am a strong proponent of free and open-source software and I have contributed to several projects related to porting Linux to embedded hardware, especially mobile phones (htc-linux.org, Replicant and I also consulted some of the FreeSmartPhone.org people). Here are the reasons I like free software (as well as free hardware and free knowledge):
On the one hand, I am a strong proponent of free and open-source software and I have contributed to several projects related to porting Linux to embedded hardware, especially mobile phones (htc-linux.org, Replicant and I also consulted some of the FreeSmartPhone.org people). Here are the reasons I like free software (as well as free hardware and free knowledge):
- The most important for me is that you can learn a lot. I have mostly learnt C and subsequently other languages by hacking on the linux kernel and following interesting projects done by fellow developers
- Practically speaking, it is just easier to maintain and develop software when you have the source code
- When you want to share a piece of data or an app with someone, if you deal with closed software, you force them into buying a paid app which may compromise their security
- You can freely distribute your contributions, your cool hacks and research results without being afraid of pursuit by some patent troll
- "Oh, it may threaten your privacy, how can you run untrusted code"? My opinion here is that running untrusted firmwares or drivers for devices is not a big deal because unless you manufacture the SoC and all the peripherals yourself, you can not be sure of what code is running in your system. For example, most X86 CPUs have SMM mode with a proprietary hypervisor, most ARMs have TrustZone mode. If your peripheral does not require you to load the firmware, it just means that the firmware is stored in some non-volatile memory chip in hardware, and you don't have the chance to disable the device by providing it with a null or fake binary. On the other hand, if your device uses some bus like I2C or USB which does not have DMA capabilities or uses IOMMU to restrict DMA access, you are relatively safe even if it runs malicious code.
- "You can examine open source software and find backdoors in it". Unfortunately this is a huge fallacy. First of all, some minor errors which can lead to huge vulnerabilities can go unnoticed for years. Recent discoveries in OpenSSL and GnuTLS came as a surprise to many of us. And then again, have you ever tried to develop a piece of legacy code with dozens of nested ifdefs when you have no clue which of them get enabled in the final build? In this case, analyzing disassembled binaries may even be easier.
- "By developing or using non-free software you support it". In the long run it would be better for humanity to have all knowledge to be freely accessible. However, if we consider a single person, their priorities may differ. For example, until basic needs (which are food, housing, safety) are satisfied, one may resort to developing close-sourced software for money. I don't think it's bad. For me, the motivation is knowledge. In the end, even if you develop stuff under an NDA, you understand how it works and can find a way to implement a free analog. This is actually the same reason I think using proprietary software is not bad in itself. For example, how could you expect someone to write a good piece of free software - an OpenGL driver, an OS kernel, a CAD until they get deeply familiar with existing solutions and their limitations?
In the end, it boils down to individual people and communities. Even proprietary platforms like Palm, Windows Mobile or Apple iOS had huge communities of helpful people who were ready to develop software, reverse-engineer hardware and of course help novices. And there are quite some douchebags around free software. Ultimately, just find the people you feel comfortable around, it is all about trust.
Minor notes about recent stuff
I've been quite busy with work and university recently so I did not have much time to work on my projects or write rants, so I decided to roll out a short post discussing several unrelated ideas.
On deprecating Linux (not really)
Recently several Russian IT bloggers have been writing down their ideas about what's broken with Linux and how it should be improved. Basically, one guy started by saying we need to throw away POSIX and write everything from scratch and the other two are trying to find a compromise. You may fire up Google Translate and follow the discussion at http://lionet.livejournal.com/131533.html
I personally think that these discussions are a bit biased because all these guys are writing from the point of view of an engineer working on distributed web systems. At the same time, there are domains like embedded software, computer graphics, high-performance computations which have other challenges. What's common is that people are realizing the obvious idea: generic solutions designed to work for everyone (like POSIX) limit the performance and flexibility of your system, while domain-specific solutions make your job easier but they may not fit well into what other people are doing.
Generally both in high-performance networking and graphics people are trying to do the same: reduce the number of context switches (it is true that on modern X86 a context switch is relatively cheap and we may use a segmented memory model like in Windows CE 5.0 instead of the common user/kernel split, but the problem is not the CPU execution mode switch but that system calls like "flush", "ioctl" and library calls like "glFlush()" are used as a point of synchronization where a lot of additional work is done besides the call itself) and move all execution into userspace. Examples include asynchronous network servers using epoll and cooperative scheduling, Intel's networking stack in user land (Netmap, DPDK), modern Graphics APIs (Microsoft DirectX 12, AMD Mantle, Apple Metal). The cost of maintaining abstractions - managing buffers and contexts in drivers - has risen so high that neither hardware vendors want to write complex drivers nor they can deliver performance. So, we'll have to step back and learn to use hardware wisely once again.
Actually I like the idea of using minimalistic runtimes on top of hypervisors like Erlang on Xen from the point of simplicity. Still, security and access control are open questions. For example, capability-based security as in L4, has always looked interesting, but whether cheap system calls and dramatically reduced "trusted code base" promises have been fulfilled is very arguable. Then again, despite the complexity of linux, its security is improving because of control groups which are heavily utilized by docker and systemd distros. Another issue is that lightweight specific solutions are rarely reusable. Well, from the economic point of view a cheap solution that's done overnight and does its job is just perfect, but generally the amount of NIH and engineers basically writing the same stuff - drivers, applications, libraries and servers in an absolutely identical way but for a dozen identical OSs is amazing and rather uncreative.
Anyway, it looks like we're there again: rise of Worse is Better
Work (vaapi, nix).
At work, among other things, I was asked to figure out how to use the Intel GPU for H.264 video encoding. Turns there are two alternatives: the open source VAAPI library and the proprietary Intel Media SDK. Actually, the latter still uses a modified version of VAAPI, but I feel that fitting it into our usual deployment routine is going to be hard, because even basic critical components of the driver stack, such as the kernel KMS module and libdrm.so are provided in the binary form.
Actually VAAPI is very low-level. One thing that puzzled me initially is that it does not generate H.264 bitstream. You have to make it yourself and feed into the encoder via a special buffer type. Besides, you have to manually take the reconstructed picture and feed it as a reference for subsequent frames. There are several implementations using this API for encoding: gstreamer, ffmpeg, vlc. I have spent quite some time until I got it to encode a sample chunk of YUV data. Ultimately my code started looking identical to the "avcenc.c" sample from libva except that encoder data is stored in a "context" structure instead of global variables.
My advice is that if you want to learn about video decoding APIs on linux and Android, take a look at Chromium source code (well, you may expect to find a solution for any relevant computer science or engineering problem given how much code it contains). And also take a look at GST, FFMPEG and VLC. Notice how each of them has its own way of managing buffers (and poor error handling btw).
On deprecating Linux (not really)
Recently several Russian IT bloggers have been writing down their ideas about what's broken with Linux and how it should be improved. Basically, one guy started by saying we need to throw away POSIX and write everything from scratch and the other two are trying to find a compromise. You may fire up Google Translate and follow the discussion at http://lionet.livejournal.com/131533.html
I personally think that these discussions are a bit biased because all these guys are writing from the point of view of an engineer working on distributed web systems. At the same time, there are domains like embedded software, computer graphics, high-performance computations which have other challenges. What's common is that people are realizing the obvious idea: generic solutions designed to work for everyone (like POSIX) limit the performance and flexibility of your system, while domain-specific solutions make your job easier but they may not fit well into what other people are doing.
Generally both in high-performance networking and graphics people are trying to do the same: reduce the number of context switches (it is true that on modern X86 a context switch is relatively cheap and we may use a segmented memory model like in Windows CE 5.0 instead of the common user/kernel split, but the problem is not the CPU execution mode switch but that system calls like "flush", "ioctl" and library calls like "glFlush()" are used as a point of synchronization where a lot of additional work is done besides the call itself) and move all execution into userspace. Examples include asynchronous network servers using epoll and cooperative scheduling, Intel's networking stack in user land (Netmap, DPDK), modern Graphics APIs (Microsoft DirectX 12, AMD Mantle, Apple Metal). The cost of maintaining abstractions - managing buffers and contexts in drivers - has risen so high that neither hardware vendors want to write complex drivers nor they can deliver performance. So, we'll have to step back and learn to use hardware wisely once again.
Actually I like the idea of using minimalistic runtimes on top of hypervisors like Erlang on Xen from the point of simplicity. Still, security and access control are open questions. For example, capability-based security as in L4, has always looked interesting, but whether cheap system calls and dramatically reduced "trusted code base" promises have been fulfilled is very arguable. Then again, despite the complexity of linux, its security is improving because of control groups which are heavily utilized by docker and systemd distros. Another issue is that lightweight specific solutions are rarely reusable. Well, from the economic point of view a cheap solution that's done overnight and does its job is just perfect, but generally the amount of NIH and engineers basically writing the same stuff - drivers, applications, libraries and servers in an absolutely identical way but for a dozen identical OSs is amazing and rather uncreative.
Anyway, it looks like we're there again: rise of Worse is Better
Work (vaapi, nix).
At work, among other things, I was asked to figure out how to use the Intel GPU for H.264 video encoding. Turns there are two alternatives: the open source VAAPI library and the proprietary Intel Media SDK. Actually, the latter still uses a modified version of VAAPI, but I feel that fitting it into our usual deployment routine is going to be hard, because even basic critical components of the driver stack, such as the kernel KMS module and libdrm.so are provided in the binary form.
Actually VAAPI is very low-level. One thing that puzzled me initially is that it does not generate H.264 bitstream. You have to make it yourself and feed into the encoder via a special buffer type. Besides, you have to manually take the reconstructed picture and feed it as a reference for subsequent frames. There are several implementations using this API for encoding: gstreamer, ffmpeg, vlc. I have spent quite some time until I got it to encode a sample chunk of YUV data. Ultimately my code started looking identical to the "avcenc.c" sample from libva except that encoder data is stored in a "context" structure instead of global variables.
My advice is that if you want to learn about video decoding APIs on linux and Android, take a look at Chromium source code (well, you may expect to find a solution for any relevant computer science or engineering problem given how much code it contains). And also take a look at GST, FFMPEG and VLC. Notice how each of them has its own way of managing buffers (and poor error handling btw).
Another thing we're using at work is the Nix package manager. I have always wanted to try it but did not really do it until coming to work to this place. Nix is a fantastic concept (Even +Norman Feske got inspired by it for their Genode OS). However, we've notices a slowdown when compiling software under Nix. Here are some of my observations:
- Compiling a simple C file with just "main" function takes 30ms in linux but >300ms in Nix chroot. Nix installs each library to a separate prefix and uses LDFLAGS and CFLAGS to direct the compiler to use them. Gcc then iterates over each of these directories trying to find each library which ends up being slow. Anyone knows a workaround?
- Running gcc under the "perf" profiler shows that it spends most of its time in multibyte string collation functions. I find it ridiculous that exporting "LC_ALL=C" magically makes the compilation time fall down to 100ms.
Hacking on FreeBSD
As a part of my GSoC project I got the FreeBSD kernel booting on Android emulator and I've just have to write the virtual ethernet and MMC drivers. Unfortunately my university has distracted me a lot from this project but now that I have time I hope to finish it by midterm. And then I'll probably port FreeBSD to Qualcomm MSM8974 SoC. Yay, red Nexus 5 is an awesome phone!
My little project (hacking is magic)
Long time ago I decided to give Windows Mobile emulation a shot and got the kernel to start booting in QEMU. Since Microsoft's Device Emulator was open-source and emulated a real Samsung S3C2410 SoC, it was easy. I still plan to get it fully working one day.
But QEMU is a boring target actually. What's more interesting is developing a bare-metal hypervisor for A15 and A7 CPUs. Given the progress made by Rob Clark on reverse-engineering Qualcomm Adreno GPU driver, I think it could be doable with reasonable effort to emulate the GPU and consequently enough hardware to run Windows Mobile and Windows RT. A very interesting thing is that ARMs trap almost all coprocessor registers for guest access (privilege levels 0 and 1) meaning you can fake any CPU ID or change memory access behavior by modifying caching and buffering settings.
What is really interesting is that there are a lot of Chinese phones which copy Nokia smartphones, iPhones, Android phones. Recent revisions of Mediatek MTK SoCs are Arm A7 meaning they support virtualization. Ultimately it means someone could make a high quality replica of any phone and virtualize any SoC without a trace which has interesting security implications.
Software sucks!
The other day some systemd update came out which totally broke my debian bootup. Now, there's a race condition between the encrypted disk password entry and root password entry for "recovery". Then, latest kernel (3.15-rc3) OOPSes and crashes randomly on ACPI events which subsequently breaks networking and spawning new processes.
Ultimately after a day of struggle my rootfs broke and after and FSCK everything was gone. So now I'm deciding what distro I should try instead. Maybe ArchLinux? Either way, I have to build a lot of software from source and install to /usr/local or custom prefixes for development.
The easiest way would be to install into a VM in OS X. But then, I want to fix issues with drivers, especially GPU switching in the macbook. On the other hand, I spent ages fixing switchable graphics on an older Sony VAIO laptop and resorted to an ugly hack to force Intel GPU. Since GPU switching is not working properly in Linux, maybe I should write a graphics multiplexor driver for FreeBSD and ditch Linux altogether? FreeBSD looks much more attractive these days with clang and no systemd and no FreeDesktop.org and no Lennart Poettering.
Tuesday, March 18, 2014
A story about clang tools
I've always wanted to try writing a toy compiler, but have not made myself actually learn the theory of parsing (I plan to do it and post some notes into the blog soon though). However, recently I've been playing with Clang and LLVM. I've not yet used it for compiling, but I want to share my experience of using and extending Clang's error detection tools.
LLVM is a specification of platform-independent bytecode and a set of tools aimed to make the development of JITed interpreters and portable native binaries easier. A significant portion of work for LLVM was done by Apple and nowadays it is widely used in industry. For example, NVIDIA uses it for compiling CUDA code, AMD uses it to generate shaders in its open-source driver. Clang is a parser and a compiler for a set of C-like languages, which includes C, C++ and Objective-C. This compiler has several properties that may be interesting for developers:
LLVM provides several frameworks for finding bugs at runtime. For example, AddressSanitizer and MemorySanitizer to catch access to uninitialized or unallocated memory.
I was given the following interesting problem at work: build some solution that would allow to detect where the application is leaking memory. Sounds like a common problem, with no satisfying answer.
Some ideas worth looking into may be:
LLVM is a specification of platform-independent bytecode and a set of tools aimed to make the development of JITed interpreters and portable native binaries easier. A significant portion of work for LLVM was done by Apple and nowadays it is widely used in industry. For example, NVIDIA uses it for compiling CUDA code, AMD uses it to generate shaders in its open-source driver. Clang is a parser and a compiler for a set of C-like languages, which includes C, C++ and Objective-C. This compiler has several properties that may be interesting for developers:
- It allows you to traverse the parsed AST and transform it. For example, add or remove the curly brackets around if-else conditionals to troll your colleagues.
- It allows you to define custom annotations via the __attribute__ extension which again can be intercepted after the AST is generated but is not yet compiled.
- It supports nearly all the features of all revisions of C and C++ and is compatible with the majority of GCC compiler options which allows to use it as a drop-in replacement. By the way, FreeBSD has switched to Clang, and on Apple OS X gcc is actually a symlink to clang!
LLVM provides several frameworks for finding bugs at runtime. For example, AddressSanitizer and MemorySanitizer to catch access to uninitialized or unallocated memory.
I was given the following interesting problem at work: build some solution that would allow to detect where the application is leaking memory. Sounds like a common problem, with no satisfying answer.
- Using Valgrind is prohibitively slow - a program running under it can be easily 50 times slower than without it.
- Using named SLAB areas (like linux kernel does) is not an option. First of, in the worst case using SLAB means only half of the memory is available for the allocation. Secondly, such approach allows to know objects of what class are occupying the memory, but now where and why they were allocated
- Using TCMalloc which hooks malloc/free calls also turned out to be slow enough to cause different behaviour in release and debugging environment, so some lightweight solution had to be designed.
Anyway, while thinking of a good way to do it, I found out that Clang 3.4 has something called LeakSanitizer (also lsan and liblsan) which is already ported to GCC 4.9. In short, it is a lightweight version of tcmalloc used in Google Perftools. It collects the information about memory allocations and prints leak locations when the application exits. It can use the LLVM symbolizer or GCC libbacktrace to print human-readable locations instead of addresses. However, it has some issues:
- It has an explicit check in the __lsan::DoLeakCheck() function which disallows it to be called twice. Therefore, we cannot use it to print leaks at runtime without shutting down the process
- Leak detection cannot be turned off when it is not needed. Hooked malloc/memalign functions are always used, and the __lsan_enable/disable function pair only controls whether statistics should be ignored or not.
The first idea was to patch the PLT/GOT tables in ELF to dynamically choose between the functions from libc and lsan. It is a very dirty approach, though it will work. You can find a code example at https://gist.github.com/astarasikov/9547918.
However, patching GOT we only divert the functions for a single binary, and we'd have to patch the GOT for each loaded shared library which is, well, boring. So, I decided to patch liblsan instead. I had to patch it either way, to remove the dreaded limitation in DoLeakCheck. I figured it should be safe to do. Though there is a potential deadlock while parsing ELF header (as indicated by a comment in lsan source), you can work around it by disabling leak checking in global variables.
What I did was to set up a number of function pointers to the hooked functions, initialized with lsan wrappers (to avoid false positives for memory allocation during libc constructors) and add two functions, __lsan_enable_interceptors and __lsan_disable_interceptors to switch between libc and lsan implementations. This should allow to use leak detection for both our code and third-party loadable shared libraries. Since lsan does not have extra dependencies on clang/gcc it was enough to stick a new CMakeLists.txt and it can now be built standalone. So now one can load the library with LD_PRELOAD and query the new functions with "dlsym". If they're present - it is possible to selectively enable/disable leak detection, if not - the application is probably using vanilla lsan from clang.
There are some issues, though
- LSAN may have some considerable memory overhead. It looks like it doesn't make much sense to disable leak checking since the memory consumed by LSAN won't be reclaimed until the process exits. On the other hand, we can disable leak detection at application startup and only enable it when we need to trace a leak (for example, an app has been running continuously for a long time, and we don't want to stop it to relaunch in a debug configuration).
- We need to ensure that calling a non-hooked free() on a hooked malloc() and vice-versa does not lead to memory corruption. This needs to be looked into, but it seems that both lsan and libc just print a warning in that case, and corruption does not happen (but a memory leak does, therefore it is impractical to repeatedly turn leak detection on and off)
We plan to release the patched library once we perform some evaluation and understand whether it is a viable approach.
Some ideas worth looking into may be:
- Add a module to annotate and type-check inline assembly. Would be good for Linux kernel
- Add a module to trace all pointer puns. For example, in Linux kernel and many other pieces of C code, casting to a void pointer and using the container_of macro is often used to emulate OOP. Now, using clang, one could possibly allow to check the types when some data is registered during initialization, casted to void and then used in some other function and casted back or even generate the intermediate code programmatically.
- Automatically replace shared variables/function pointer calls with IPC messages. That is interesting if one would like to experiment with porting Linux code to other systems or turning Linux into a microkernel
Monday, January 27, 2014
porting XNU to ARM (OMAP5 CPU)
Two weeks ago I have taken some time to look at the XNU port to the ARM architecture done by the developer who goes by the handle "winocm". Today I've decided to summarize my experience.
Here is a brief checklist if you want to start porting XNU to your board:
The most undocumented step is actually using the image3maker tool and building bootable images. You have to put the to the "images" directory in the GenericBooter-next source. As for the ramdisk, you may find some on github or unpack the iphone firmware, but I simply created an empty file, which is OK since I've not got that far in booting.
Building GenericBooter-next is straightforward, but you need to export the path to the toolchain, and possibly edit the Makefile to point to the correct CROSS_COMPILE prefix
rm images/Mach.*
../image3maker/image3maker -f ../xnu/BUILD/obj/DEBUG_ARM_OMAP5432_UEVM/mach_kernel -t krnl -o images/Mach.img3
make clean
make
For the ramdisk, you should use the "rdsk" type, and "dtre" for the Device Tree (apparently there's also xmdt for xml device tree, but I did not try that);
Booting on the omap5 via fastboot (without u-boot):
./usbboot -f &
fastboot -c "-no-cache serial=0 -v -x cpus=1 maxmem=256 mem=256M console=ttyO2" -b 0x83ff8040 boot vmlinux.raw
Some notes about the current organization of the XNU kernel and what can be improved:
OMAP5 is an ARM Cortex-A15 core. Currently XNU port only has support for the A9 CPU, but if we're booting without SMP and L2 cache, the differences between these architectures are not a big problem. OMAP5 has a generic ARM GIC (Global Interrupt Controller), which is actually compatible to the GIC in the MSM8xxx CPUs, namely with APQ8060 in HP TouchPad, the support for which was added by winocm. UART is a generic 16550, compatible to the one in OMAP3 chip. Given all this, I have managed to get the kernel basically booting and printing some messages to the UART.
Unfortunately, I have not managed to bring up the timer yet. The reason is that I was booting the kernel via USB directly after OMAP5 internal loader, skipping u-boot and hardware initialization. Somehow the internal eMMC chip in my board does not allow me to overwrite the u-boot partition (although I did manage to do it once when I received the board)
I plan to look into it once again, now with hardware pre-initialized by u-boot, and write a detailed post. Looks like the TODO is the following:
Here is a brief checklist if you want to start porting XNU to your board:
- Start from reading Wiki https://github.com/darwin-on-arm/wiki/wiki/Building-a-bootable-system
- Clone the DeviceTrees repository: https://github.com/darwin-on-arm/DeviceTrees . Basically, you can use slightly modified DTS files from Linux, but due to the fact that DTC compiler is unfinished, you'll have to rename and comment out some entries. For example, macros in included C headers are not expanded, so you'll have to hardcode values for stuff like A15 VGIC IRQs
- Get image3maker which is a tool to make images supported both by GenericBooter and Apple's iBoot bootloaders https://github.com/darwin-on-arm/image3maker
- Use the DTC from https://github.com/winocm/dtc-AppleDeviceTree to compile the abovementioned DTS files
- Take a look at init/main.c . You may need to add a hackish entry the way it's done for the "HD2" board to limit the amount of RAM available.
- I have built all the tools on OS X, but then found out that it's easier to use the prebuilt linux chroot image available at: https://github.com/stqism/xnu-chroot-x86_64
The most undocumented step is actually using the image3maker tool and building bootable images. You have to put the to the "images" directory in the GenericBooter-next source. As for the ramdisk, you may find some on github or unpack the iphone firmware, but I simply created an empty file, which is OK since I've not got that far in booting.
Building GenericBooter-next is straightforward, but you need to export the path to the toolchain, and possibly edit the Makefile to point to the correct CROSS_COMPILE prefix
rm images/Mach.*
../image3maker/image3maker -f ../xnu/BUILD/obj/DEBUG_ARM_OMAP5432_UEVM/mach_kernel -t krnl -o images/Mach.img3
make clean
make
Booting on the omap5 via fastboot (without u-boot):
./usbboot -f &
fastboot -c "-no-cache serial=0 -v -x cpus=1 maxmem=256 mem=256M console=ttyO2" -b 0x83ff8040 boot vmlinux.raw
Some notes about the current organization of the XNU kernel and what can be improved:
- All makefiles containing board identifiers should be sorted alphabetically, just for convenience when adding newer boards.
- Look into memory size limit. I had to limit the RAM to 256Mb in order to boot. If one enables the whole 2Gb available on the OMAP5432 uEVM board, the kernel fails to "steal pages" during the initialization (as can be seen in the screenshot).
- I have encountered an issue
OMAP5 is an ARM Cortex-A15 core. Currently XNU port only has support for the A9 CPU, but if we're booting without SMP and L2 cache, the differences between these architectures are not a big problem. OMAP5 has a generic ARM GIC (Global Interrupt Controller), which is actually compatible to the GIC in the MSM8xxx CPUs, namely with APQ8060 in HP TouchPad, the support for which was added by winocm. UART is a generic 16550, compatible to the one in OMAP3 chip. Given all this, I have managed to get the kernel basically booting and printing some messages to the UART.
Unfortunately, I have not managed to bring up the timer yet. The reason is that I was booting the kernel via USB directly after OMAP5 internal loader, skipping u-boot and hardware initialization. Somehow the internal eMMC chip in my board does not allow me to overwrite the u-boot partition (although I did manage to do it once when I received the board)
I plan to look into it once again, now with hardware pre-initialized by u-boot, and write a detailed post. Looks like the TODO is the following:
- Mirror linux rootfs (relying on 3rd party github repos is dangerous)
- Bringing up OMAP5 Timer
- Building my own RAMDisk
- Enabling eMMC support
- Playing with IOKit