Sunday, November 8, 2015

Porting legacy RTOS software to a custom OS.

So lately I've been working on porting a large-sized application from a well-known RTOS to a custom kernel. The RTOS is VxWorks and the application is a router firmware. Unfortunately I will not be able to share any code of the porting layer. I'll just briefly go over the problems I've endeavoured and some thoughts I have come up with in the process.

The VxWorks compiler (called "diab") is like many other proprietary C compilers based on the Edison frontend. Unlike GCC or clang, by default it is quite lax when it comes to following the standards. Besides, the original firmware project uses the VxWorks Workbench IDE (which is of course a fork of Eclipse) and runs on Windows.

The first thing I had to do was to convert the project build system to Linux. The desirable way would be to write a parser that would go over the original build system and produce a Makefile. The advantage of this approach would be the ability to instantly obtain a new Makefile when someone changes the original project code. However, during the prototyping stage it makes a lot of sense to take shortcuts wherever possible, so I ended up writing a tiny wrapper program to replace the compiler and linker binaries. It would intercept its arguments, write them out to a log file and act as a proxy launching the original program. After that it was just a matter of making minor changes to the log file to convert it to a bash script and later on to a Makefile.

As an effect of the development done on Windows, the majority of files have the same set of defects that prevent them from being compiled by a linux-based compiler: paths in the "#include" directive often have incorrect case, reverse slashes and so on. Worst is that some of the "#include" directives included multiple files in one set of brackets. Luckily, this was easy to fix with a script that parsed the compilation log for the "file not found" errors, looked for the corresponding file ignoring the case and fixed up the source code. In the end I was left with about a dozen places that had to be fixed manually.

Implementing the compatibility layer.

I have done a quick research into the available options and saw that the only up-to-date solution implementing the VxWorks API on other OSs is "Xenomai". However, it is quite intrusive because it relies on a loadable kernel module and some patches to the linux kernel to function. Since we were not interested in getting realtime behaviour but wanted to run on both our OS and Linux and entirely in userspace, I decided to write yet another VxWorks emulation layer.
The original firmware comes as a single ELF file which is reasonable because in VxWorks all processes are implemented as threads in a shared address space. Besides, VxWorks provides a POSIX-compatible API for developers. So in order to identify which functions needed implementation it was enough to try linking the compiled code into a single executable.

"One weird trick" useful for creating porting layers and DDEKits is the GCC/clang option "include" which allows you to prepend an include to absolutely all files compiled. This is super useful. You can use such an uber-header to place the definitions of the data types and function prototypes for the target platform. Besides, you can use it to hook library calls at compile-time.

One of the problem that took a great amount of time was implementing synchronization primitives. Namely, mutexes and semaphores. The tricky semantic difference between semaphores and mutexes in VxWorks is that the latter are recursive. That means that once a thread has acquired a mutex, it is allowed to lock it any number of times as long as the lock/unlock count is balanced.
Before I realized this semantic difference, I couldn't figure out why the software would always lock up, and disabling locking altogether led to totally random crashes.

Ultimately I became frustrated and ended up with a simple implementation of a recursive mutex that has allowed me to move much further (Simple Recursive Mutex Implementation @ github). Later for the purposes of debugging I also added the options to print backtrace indicating the previous lock owner when trying to enter the critical section or when the spinlock took too many attempts.

Hunting for code defects

Uninitialized variables and what can we do about them

One problem I came across was that the code had a lot of uninitialized variables, hundreds of them. On the one hand, the proper solution is to inspect each case manually and write the correct initializer. On the other hand, the code works when compiled with the diab compiler so it must zero-initialize them.
So I went ahead and wrote a clang rewriter plugin to add the initializers: zero for the primitive types and a pair of curly braces for structs. (clang rewriter to add initializers). However, I realized that the biggest problem is that some functions use the convention of returning zero for indicating a failure while other return a non-zero value. This means, we cannot have a generic safe initializer that would make the code take the "fault" path when reaching the rewritten code. An alternative to manual inspection could be writing sophisticated rules for the rewriter to detect the convention used.

I ended up using valgrind and manually patching some warnings. AddressSanitizer was also useful. However fixing each warning and creating a blacklist is too tiresome. I ended up setting the breakpoint on the "__asan_report_error" function and a script that would make gdb print backtrace, return and continue execution.

Duplicate structures

One problem I supposed could be present in the code (due to the deep hirarchy of #ifdefs) is the presence of the structures with the same name but different content. I made up an example of a C program to demonstrate an effect where the compiler does not warn about the type mimatch but at runtime the code silently corrupts memory.

I figured out an easy way of dealing with the problem. I ended up using clang and emitting LLVM bittecode for each file instead of the object files with machine code. Then I linked them together into a single bitcode file and disassembled with llvm-dis.

The nice thing about llvm is that when it finds two structures having the same name but declared differently, it would append a different numeric suffix to the struct name. Then one could just remove the suffixes and look for unique lines with different structure declarations.

Luckily for me, there was only one place where I supposed an incorrect definition, and it was not in the part of the code I was executing, so I ruled out this option as a source of incorrect behavior.

Further work

Improving tooling for the 32-bit world

It is quite unfortunate but the project I have been working on is 32-bit. And one cannot simply convert it into a 64-bit one by a compiler flag. The problem is that the code has quite a lot of places where pointers are for some reason stored into packed structures and some other structures implicitly rely on the structure layout. So it is very difficult to modify this fragile mess.
It is sad that two great tools, MemorySanitizer and ThreadSanitizer, are not available for the 32-bit applications. It is understandable because the 32-bit address space is too tiny to fit the shadow memory. I am thinking of ways to make them available for the 32-bit world. So far I see two ways of solving the problem.
First, we can use the fragile and non-portable flag for the mmap (which is currently only supported on linux) to force allocations to below the 4gig limit. Then one could write an ELF loader that would load all the code below 4gigs, and use the upper memory range for shadows. Besides being non-portable, the other disadvantages of this approach include having to deal with 64-bit identifiers such as file handles.
Alternatively we could store the shadow in a separate process and use shared memory or sockets for communication. That would likely be at least an order slower than using the corresponding sanitizer in a 64-bit world, but likely still faster than valgrind and besides it is a compile-time instrumentation with more internal data from the compiler.

Verifying the porting was done correctly

Now I am left with one challenging task: verify the ported SW is identical to what could be built using the original build system.

One may notice that simply intercepting the calls to the compiler may not be enough because the build system may copy the files or export some shell variables during the build process. Besides, different compilers have different ways of handling "inline" and some other directives. It would be good to verify that the call graph of the original binary and the one produced by our Makefile is similar. (of course we will need to mark some library functions as leaf nodes and not analyze them). For a start I could try inspecting some of the unresolved symbols manually, but I'm thinking of automating the process. I think for this task I'll need a decompiler that can identify basic blocks. Probably Capstone engine should do the job. Any ideas on that?

P.S. Oh, and I once tried visualizing the dependency graph of separate ".o" files (before I realized I could just link them altogether and get the list of missing symbols) and trust me those graphs grow really fast. I have found out that a tool called "gephi" does a decent job at visualizing really huge graphs and supports Graphviz's dot as the input format.

EDIT 2015-11-10
The challenge is complicated by the fact that there some subprojects have multiple copies (with various changes) and one should also ensure that the headers and sources are picked up from the correct copy. However, I've found an easy and acceptable solution. I just wrote a tool that parses the call graph generated by IDA and for every edge in the graph it looks up the function names of the corresponding vertices. Then it just prints a list of pairs "A -> B" for every function A calling function B. After that, we can sort the file alphabetically and remove the nodes that are uninteresting to us (the OS and library functions). Next, we can compare the files side-by-side with a tool like kdiff3 (or we can do it automatically). Whenever there is a significant difference (for example, 5 callees are different for the same caller), we can inspect manually and verify we're compiling the correct file with the correct options. Using this method I have identified several places where we chose the wrong object file for linking and now we're only concerned with the porting layer and OS kernel, without having to worry about the application itself.

Wednesday, May 20, 2015

on making a DDE kit, kinda

So I have this task of porting a huge piece of software running on a proprietary OS to another OS. And I don't even have a clue how to compile it (well I do but it builds on windows so it's almost irrelevant).

But luckily all code is linked into a single ELF file and the compilation produces intermediate object files. The first thought I had was to visualize the dependency graph of object files to find out who calls what. You can find the script below that will recursively walk the supplied directory and try to parse the import/export table with objdump. There are some areas for improvement (for example, parsing also the dynsym table with -T or parsing .a archives) but it did the work for me.

Unfortunately I realized that visualizing the graph with 30K edges was not even remotely a smart idea

What I've also found out was that the OS-specific code and objects was stored in a separate location (since it was a part of an SDK). Even if it were not, we could just remove those object files that were both present in our project and in the SDK. After that, all the functions that the application was requiring from the OS unsurprisingly ended up in the "UNDEFINED" node and there were only 200 of them which gives me some hope.

This approach can also be used for other use-cases. For example, porting drivers from Linux/FreeBSD to exotic platforms - first build the binaries, then pick as many of them as you can to minimize the required functions list. I find dealing with compiled code easier because build systems, C macros and ifdefs just drive me insane.

Friday, May 15, 2015

GCOV is amazing yet undocumented

One useful technique for maintaining software quality is code coverage. While routinely used by high-level developers it is completely forgotten by many C hackers, especially when it comes to kernel. In fact, Linux is the only kernel which supports being compiled with the GCOV coverage tool.

GCOV works by instrumenting your code. It inserts some code to increment the stats counters around each basic block. These counters reside in a special section of your ELFs. During compilation, GCC generates a ".gcno" file for each ".c" file. These files allow the "gcov" tool to lookup function names and other info using the integer IDs (which are specific to each file).

At runtime, an executable built with GCOV produces a file called ".gcda" which contains the values of the counters. During the executable initialization, constructors (which are function pointers in the ".ctors" section) are called. One of them is "__gcov_init" which registers a certain ".o" file inside libgcov.

Since we're running in kernel or "bare-metal", we don't have neither libgcov nor file system to dump the ".gcda" files. But one should not fear GCOV! Adding it to your kernel is just a matter of adding one C file (which is mostly shamelessly copy-pasted from linux kernel and gcc sources) and a couple CFLAGS. In my example I'm using the LK kernel by Travis Geiselbrecht ( I've decided to just print out the contents of the ".gcda" files to the serial console (UART) as hex dump and then use an AWK script and the "xxd" tool to convert them to binaries on the host. This works completely fine since these files are typically below 2KB in size.

An important thing to notice: if your kernel does not contain the ".ctors" section and does not call the constructors, be sure to add them to the ld script and add some code to invoke them. For example, here's how LK does that:

You can see the whole patch below.

After running the "gcovr" tool you can get a nice HTML with summary and see which lines were executed and which were not and add the tests for the latter. Woot!

Friday, October 24, 2014

abusing Mesa by hooking ELFs and ioctl

At work I had several occasions when I needed to hook a function on a Linux system. I will not go deep into technical details but the examples provided in the links contain the code which can be easily plugged into a project.

A simple case.
First let us consider a case when we only need to replace a function in our application without replacing it in dynamic libraries loaded by the application.

In Linux, functions in dynamic libraries are resolved using two tables: PLT (procedure linkage table) and GOT (global offset table). When you call a library function, you actually call a stub code in the PLT. What PLT essentially does is loading the function address from the GOT. On an X86_64, it does that with the "jmpq *" instruction (ff 25). Initially GOT contains the pointer back to PLT, which will invoke the resolver which will write the correct entry to GOT.

So, to replace a library function, we must patch the entry in GOT. How to find it? We can decode the offset to GOT (which is the relative to PLT + JMPQ instruction length) from the JMPQ instruction. How do we find the PLT then? It's simply the function pointer - that is, in C, use the function name as a pointer.

We should call "dlsym" to forcibly resolve the entry in GOT before replacing it. For example, if we want to swap two dynamic functions. A complete example is available as a github gist:

A more advanced example
The approach described above will not work when you want to replace a function that is used by some dynamically loaded binary since the library will have its own set of GOT and PLT tables.

On X86_64, we can find the PLT table by examining the contents of the ELF file. The section we care about is called ".rela.plt".

We need to example the ELF header (Ehdr), find the section header (Shdr) and find the "dynsym" and "rela.plt" tables. The dynsym table unsurprisingly contains the pointers to dynamically resolved functions and can be used to find the function by name. You can find the link to the code below. I've mmap'ed the library file to get a pointer to its contents and use it to read the data. Some examples floating around the internet are using the "malloc", "read()" and "memcpy" calls for the same purpose. To get the pointer to the library in the application's virtual memory space, you can call "dlopen" and dereference the pointer returned by the function. You will need this address to convert the PLT offset into the address. Technically you can get away without reading/mmap'ing the library from disk, but since some sections are not mapped for reading, you will need to use "mprotect" to access them without receiving SIGSEGV.

The most straightforward alternative for such runtime patching is using the "LD_PRELOAD" variable and specify a library with the implementation of the hooked routine. The linker will then redirect all calls to that function in all libraries into the preloaded library. However, the obvious limitation of this approach is that it may be hard to get it working if you preload multiple libraries overriding the same symbol. It breaks some tools like "apitrace" (which is a tool to trace and replay OpenGL calls).

Practical example with Mesa
Many GPUs nowadays contain encoders which will encode video to H.264 or other popular formats. Intel GPUs (namely HD4000 and Iris Graphics series) have an encoder for H.264. Some solutions like NVIDIA Grid utilize hardware capabilities to stream games over the network or provide the remote desktop facilities with low CPU load.

Unfortunately both proprietary and the open-source drivers for the Intel hardware lack the ability to export and OpenGL resource into the encoder library which is a desired option for implementing a screen recorder (actually the proprietary driver from the SDK implements the advanced encoding algorithm as a binary library but uses the pre-compiled versions of the vaapi and libdrm for managind resources).

If we don't share a GL texture with the encoder, we cannot use the zero-copy path. The texture contents will need to be read with "glReadPixels()" and then converted to the NV12 or YUV420 format (though it can be done by a GLSL shader before reading the data) and re-uploaded to the encoder. This is not an acceptable solution since each frame will take more than 30ms and cause 80% CPU thus leaving no processing power for other parts of the application. On the other hand, using a zero-copy approach will allow us to have a smooth 60FPS performance - the OpenGL side will not be blocked by "glReadPixels()" and CPU load will never exceed 10 percent.

To be more precise, resource sharing support is present if one uses the "EGL" windowing system and the one of the MESA extensions. Technically, all GPU/encoder memory is managed by the DRM framework in Linux kernel. There is an extension which allows to obtain a DRM handle (which is an uint32) from an OpenGL texture. It is used by Wayland in the Weston display server. An example of using VAAPI to encode a DRM handle can be found in Weston source code:

Now, the problem is to find a handle of an OpenGL texture. Under EGL that can be done by creating an EGLImage from the texture handle and subsequently calling eglExportDRMImageMESA.

In my case the problem was that I didn't want to use EGL because it is quite difficult to port a lot of legacy GLX codebase to EGL. Besides, GLX support is more mature and stable with the NVIDIA binary driver. So the problem is to get a DRM buffer out of a GL texture in GLX.

Unfortunately GLX does not provide an equivalent extension and implementing one for Mesa and X11 is rather complicated due to the complexity of the code. One option would be to link against Mesa and use its internal headers to reach for the needed resources as done by Beignet, the open-source OpenCL driver for Intel ( which is error-prone. Besides, setting up such a build environment is complicated.

Luckily, as we've seen above, the memory is managed by the DRM which means every memory allocation is actually an IOCTL call to kernel. Which means we can hook the ioctl and steal the DRM handle and later use it as the input to VAAPI. We need to look for the DRM_I915_GEM_SET_TILING ioctl code.

The obvious limitation is that you will need to store a global reference which will be written by the ioctl hook which makes the code thread-unsafe. Luckily most OpenGL apps are single-threaded and even when they're not, there are less than dozen threads which can access OpenGL resources so lock contention is not an issue and a pthread mutex can be used.

Another limitation is that to avoid memory corruption or errors, we need to carefully track the resources. One solution would be to allocate OpenGL texture prior to using VAAPI and deallocate after the encoder is destroyed. The other one is to hook more DRM calls to get a pointer to the "drm_intel_bo" structure and increase its reference count.

Sunday, September 7, 2014

GSoC results or how I nearly wasted summer

This year I got a chance to participate in the Google Summer of Code with the FreeBSD project.
I have long wanted to learn about the FreeBSD kernel internals and port it to some ARM board but was getting sidetracked whenever I tried to do it. So it looked like a perfect chance to stop procrastinating.

Feel free to scroll down to the "other stuff" paragraph if you don't feel like reading thre paragraphs of a typical whinig associated with debugging C code.

Initially I wanted to bring up some real piece of hardware. My choice was the Nexus 5 phone which has Qualcomm SoC. However, some of the FreeBSD developers suggested that instead I port the kernel to Android Emulator. It sounded like a sane idea since Android Emulator is available for the major OSes and is easy to set up which will allow to expose more people to the project. Besides, unlike QEMU, it can emulate some peripherals specific to a mobile device such as sensors.

What is Android Emulator? In fact, it is an emulator based on QEMU which emulates a virtual board called "Goldfish" which includes a set of devices such as interrupt controller, timer, input devices. It is worth noting that Android Emulator is based on an ancient version of QEMU though starting with Android L one can obtain the binaries of the emulator from git which is based on a more recent build.

I have started from implementing the basics: the interrupt controller, the timer and the UART driver. That has allowed me to see the boot console but then I got stuck for literally months with kernel crashing at various points in what seemed like a totally random fashion. I have spent nights single-stepping the kernel with my sight slowly turning from barely red eyes to emitting hard radiation. It was definitely a problem with caching but ultimately I gave up trying to fix it. Since I knew that FreeBSD kernel worked fine on ARM in a recent version of QEMU, it was clearly a bug in Android Emulator, even though it has not manifested itself in Linux. Since Android Emulator is going to eventually get updated to a new QEMU base and I was running out of time, I decided to just use a nightly build of emulator instead of the one coming with the Android SDK and that fixed this particular issue. 
But hey! At least I've learnt about the FreeBSD virtual memory management and the UMA allocator the hard way.

Next up was fighting the timer and random freezes while initializing the random number generator (sic!). That was easy - turns out I just forgot to call the function. Anyway it would make sense to move that call out of the board files and call it for every board that does not define the "NO_EVENTTIMERS" option in the kernel config.

Fast forward to the start of August when there are only a couple days left and I still don't have a working block device driver to boot rootfs! Writing a MMC driver turned out to be very easy and I started celebrating when I got the SD card image from Raspberry PI booting in Android Emulator and showing the login prompt.

It worked but with a huge kludge in the OpenFirmware bindings. Somehow one of the functions which should've returned the driver bus name returned (seemingly) random crap so instead of a NULL pointer check I had to add a check that the address points to the kernel VM range. I have then tracked down the real source of the problem.
I was basing my MMC driver on the LPC SoC driver. LPC driver itself is suspicious - for example, the place where DMA memory is allocated says "allocating Framebuffer memory" which may be an indicator that it was written in a hurry and possibly only barely tested. At the MMC bus attach routine, it was calling the "device_set_ivars" function and setting the ivars pointer to the mmc host structure. I have observed the similiar pattern in some other drivers. In the OpenFirmware code, the ivars structure was dereferenced as a completely other structure and a string was taken from a member of this structure.

How the hell did it happen in a world where people are using the recent Clang compiler and compile with "-Wall" and "-Werror"? Uhh, well, if you're dealing with some kind of OOP in plain C, casting to void pointer and back and other ill-typed wicked tricks are inevitable. Just look at that prototype and cry with me:

>> device_set_ivars(device_t dev, void *ivar);

For now, I'm pushing the changes to the following branch. Please note that I'm rebasing against master and force-pushing in process if you're willing to test.
So, what has been achieved and what am I still going to do? Well, debugging MMU issues and OpenFirmware bug has delayed me substantially so I've not done everything I've planned. Still
  • Basic hardware works in Goldfish: Timer, IRQs, Ethernet, MMC, UART
  • It is using NEWCONS for Framebuffer Console (which is cool!)
  • I've verified that Android Emulator works in FreeBSD with linuxulator
  • I have fixed Android Emulator to compile natively on FreeBSD and it mostly works (at least, console output) except graphics (which I think should be an easy fix. there are a lot of places where "ifdef __linux__" is completely unjustified and should rather be enabled for all unices)
How to test and contribute?
You can use the linux build of Android Emulator using Linuxulator. Unfortunately, linuxulator only supports 32-bit binaries so we cannot run nightly binaries which have the MMU issues fixed.
Therefore, I have fixed the emulator to compile natively on FreeBSD! I've pushed the branch named "l-preview-freebsd" to github. You will also need the gtest. I cloned the repo from CyanogenMod and pushed a tag named "l-preview-freebsd", which is actually "cm-11.0".

Compiling the emulator:
>> git clone
>> cd cd android_external_gtest
>> git checkout cm-11.0
>> cd ../qemu.git
>> git checkout l-preview-freebsd
>> bash

Please refer to README file in git for the detailed instructions on building kernel.

The only thing that really bothers me about low-level and embedded stuff is that it is extremely difficult to estimate how long it may take to implement a certain feature because every time you end up debugging obscure stuff and the "useful stuff you learn or do" / "time wasted" ratio is very small. On a bright side, while *BSD lag a bit behind Linux in terms of device drivers and performance optimizations, reading and debugging *BSD code is much easier than GNU projects like EGLIBC, GCC and GLib.

Other stuff.
Besides GSoC at the start of summer I had a chance to work with Chris Wade (who is an amazing hacker, by the way). We started working on an ARM virtualization project and have spent nice days debugging caching issues and instruction decoding while trying to emulate a particular cortex A8 SoC on an A15 chip. Unfortunately working on GSoC, going to a daily job and doing this project simultaneously turned out to be surprisingly difficult and I had to quit at least one of the activities. Still I wish Chris good luck with this project and if you're interested in virtualization and iPhones, sign up for early demo at

Meanwhile I'm planning to learn ARMv8 ISA. It's a pity there is no hackable hardware available for reasonable prices yet. QEMU works fine with VirtIO peripherals though. But I'm getting increasingly worried about devkits costing more and more essentially making it more difficult for a novice to become an embedded software engineer.

Friday, June 6, 2014

On Free Software and if proprietary software should be considered harmful

I communicate with a lot of people on the internet and they have various opinions on FOSS ranging from "only proprietary software written by a renowned company can deliver quality" to "if you're using binary blobs, you're a dick". Since these issues arise very often in discussions, I think I need to write it up so I can just shove the link next time.

On the one hand, I am a strong proponent of free and open-source software and I have contributed to several projects related to porting Linux to embedded hardware, especially mobile phones (, Replicant and I also consulted some of the people). Here are the reasons I like free software (as well as free hardware and free knowledge):

  • The most important for me is that you can learn a lot. I have mostly learnt C and subsequently other languages by hacking on the linux kernel and following interesting projects done by fellow developers
  • Practically speaking, it is just easier to maintain and develop software when you have the source code
  • When you want to share a piece of data or an app with someone, if you deal with closed software, you force them into buying a paid app which may compromise their security
  • You can freely distribute your contributions, your cool hacks and research results without being afraid of pursuit by some patent troll
But you know, some people are quite radical in their FOSS love. They insist that using anything non-free is a sin. While I highly respect them for their attitude, I have a different point of view and I want to comment on some common arguments against using or developing non-free software:
  • "Oh, it may threaten your privacy, how can you run untrusted code"? My opinion here is that running untrusted firmwares or drivers for devices is not a big deal because unless you manufacture the SoC and all the peripherals yourself, you can not be sure of what code is running in your system. For example, most X86 CPUs have SMM mode with a proprietary hypervisor, most ARMs have TrustZone mode. If your peripheral does not require you to load the firmware, it just means that the firmware is stored in some non-volatile memory chip in hardware, and you don't have the chance to disable the device by providing it with a null or fake binary. On the other hand, if your device uses some bus like I2C or USB which does not have DMA capabilities or uses IOMMU to restrict DMA access, you are relatively safe even if it runs malicious code.
  • "You can examine open source software and find backdoors in it". Unfortunately this is a huge fallacy. First of all, some minor errors which can lead to huge vulnerabilities can go unnoticed for years. Recent discoveries in OpenSSL and GnuTLS came as a surprise to many of us. And then again, have you ever tried to develop a piece of legacy code with dozens of nested ifdefs when you have no clue which of them get enabled in the final build? In this case, analyzing disassembled binaries may even be easier.
  • "By developing or using non-free software you support it". In the long run it would be better for humanity to have all knowledge to be freely accessible. However, if we consider a single person, their priorities may differ. For example, until basic needs (which are food, housing, safety) are satisfied, one may resort to developing close-sourced software for money. I don't think it's bad. For me, the motivation is knowledge. In the end, even if you develop stuff under an NDA, you understand how it works and can find a way to implement a free analog. This is actually the same reason I think using proprietary software is not bad in itself. For example, how could you expect someone to write a good piece of free software - an OpenGL driver, an OS kernel, a CAD until they get deeply familiar with existing solutions and their limitations?
Regarding privacy, I am more concerned about the government and "security" agencies like NSA which have enough power to change laws and fake documents. Officials change, policy changes and even people who considered themselves safe and loyal patriots may be suddenly labeled traitors.

In the end, it boils down to individual people and communities. Even proprietary platforms like Palm, Windows Mobile or Apple iOS had huge communities of helpful people who were ready to develop software, reverse-engineer hardware and of course help novices. And there are quite some douchebags around free software. Ultimately, just find the people you feel comfortable around, it is all about trust.

Minor notes about recent stuff

I've been quite busy with work and university recently so I did not have much time to work on my projects or write rants, so I decided to roll out a short post discussing several unrelated ideas.

On deprecating Linux (not really)
Recently several Russian IT bloggers have been writing down their ideas about what's broken with Linux and how it should be improved. Basically, one guy started by saying we need to throw away POSIX and write everything from scratch and the other two are trying to find a compromise. You may fire up Google Translate and follow the discussion at

I personally think that these discussions are a bit biased because all these guys are writing from the point of view of an engineer working on distributed web systems. At the same time, there are domains like embedded software, computer graphics, high-performance computations which have other challenges. What's common is that people are realizing the obvious idea: generic solutions designed to work for everyone (like POSIX) limit the performance and flexibility of your system, while domain-specific solutions make your job easier but they may not fit well into what other people are doing.

Generally both in high-performance networking and graphics people are trying to do the same: reduce the number of context switches (it is true that on modern X86 a context switch is relatively cheap and we may use a segmented memory model like in Windows CE 5.0 instead of the common user/kernel split, but the problem is not the CPU execution mode switch but that system calls like "flush", "ioctl" and library calls like "glFlush()" are used as a point of synchronization where a lot of additional work is done besides the call itself) and move all execution into userspace. Examples include asynchronous network servers using epoll and cooperative scheduling, Intel's networking stack in user land (Netmap, DPDK), modern Graphics APIs (Microsoft DirectX 12, AMD Mantle, Apple Metal). The cost of maintaining abstractions - managing buffers and contexts in drivers - has risen so high that neither hardware vendors want to write complex drivers nor they can deliver performance. So, we'll have to step back and learn to use hardware wisely once again.

Actually I like the idea of using minimalistic runtimes on top of hypervisors like Erlang on Xen from the point of simplicity. Still, security and access control are open questions. For example, capability-based security as in L4, has always looked interesting, but whether cheap system calls and dramatically reduced "trusted code base" promises have been fulfilled is very arguable. Then again, despite the complexity of linux, its security is improving because of control groups which are heavily utilized by docker and systemd distros. Another issue is that lightweight specific solutions are rarely reusable. Well, from the economic point of view a cheap solution that's done overnight and does its job is just perfect, but generally the amount of NIH and engineers basically writing the same stuff - drivers, applications, libraries and servers in an absolutely identical way but for a dozen identical OSs is amazing and rather uncreative.

Anyway, it looks like we're there again: rise of Worse is Better

Work (vaapi, nix).
At work, among other things, I was asked to figure out how to use the Intel GPU for H.264 video encoding. Turns there are two alternatives: the open source VAAPI library and the proprietary Intel Media SDK. Actually, the latter still uses a modified version of VAAPI, but I feel that fitting it into our usual deployment routine is going to be hard, because even basic critical components of the driver stack, such as the kernel KMS module and are provided in the binary form.

Actually VAAPI is very low-level. One thing that puzzled me initially is that it does not generate H.264 bitstream. You have to make it yourself and feed into the encoder via a special buffer type. Besides, you have to manually take the reconstructed picture and feed it as a reference for subsequent frames. There are several implementations using this API for encoding: gstreamer, ffmpeg, vlc. I have spent quite some time until I got it to encode a sample chunk of YUV data. Ultimately my code started looking identical to the "avcenc.c" sample from libva except that encoder data is stored in a "context" structure instead of global variables.

My advice is that if you want to learn about video decoding APIs on linux and Android, take a look at Chromium source code (well, you may expect to find a solution for any relevant computer science or engineering problem given how much code it contains). And also take a look at GST, FFMPEG and VLC. Notice how each of them has its own way of managing buffers (and poor error handling btw).
Another thing we're using at work is the Nix package manager. I have always wanted to try it but did not really do it until coming to work to this place. Nix is a fantastic concept (Even +Norman Feske got inspired by it for their Genode OS). However, we've notices a slowdown when compiling software under Nix. Here are some of my observations:
  • Compiling a simple C file with just "main" function takes 30ms in linux but >300ms in Nix chroot. Nix installs each library to a separate prefix and uses LDFLAGS and CFLAGS to direct the compiler to use them. Gcc then iterates over each of these directories trying to find each library which ends up being slow. Anyone knows a workaround?
  • Running gcc under the "perf" profiler shows that it spends most of its time in multibyte string collation functions. I find it ridiculous that exporting "LC_ALL=C" magically makes the compilation time fall down to 100ms.
Hacking on FreeBSD
As a part of my GSoC project I got the FreeBSD kernel booting on Android emulator and I've just have to write the virtual ethernet and MMC drivers. Unfortunately my university has distracted me a lot from this project but now that I have time I hope to finish it by midterm. And then I'll probably port FreeBSD to Qualcomm MSM8974 SoC. Yay, red Nexus 5 is an awesome phone!

My little project (hacking is magic)
Long time ago I decided to give Windows Mobile emulation a shot and got the kernel to start booting in QEMU. Since Microsoft's Device Emulator was open-source and emulated a real Samsung S3C2410 SoC, it was easy. I still plan to get it fully working one day.

But QEMU is a boring target actually. What's more interesting is developing a bare-metal hypervisor for A15 and A7 CPUs. Given the progress made by Rob Clark on reverse-engineering Qualcomm Adreno GPU driver, I think it could be doable with reasonable effort to emulate the GPU and consequently enough hardware to run Windows Mobile and Windows RT. A very interesting thing is that ARMs trap almost all coprocessor registers for guest access (privilege levels 0 and 1) meaning you can fake any CPU ID or change memory access behavior by modifying caching and buffering settings.

What is really interesting is that there are a lot of Chinese phones which copy Nokia smartphones, iPhones, Android phones. Recent revisions of Mediatek MTK SoCs are Arm A7 meaning they support virtualization. Ultimately it means someone could make a high quality replica of any phone and virtualize any SoC without a trace which has interesting security implications.

Software sucks!
The other day some systemd update came out which totally broke my debian bootup. Now, there's a race condition between the encrypted disk password entry and root password entry for "recovery". Then, latest kernel (3.15-rc3) OOPSes and crashes randomly on ACPI events which subsequently breaks networking and spawning new processes.

Ultimately after a day of struggle my rootfs broke and after and FSCK everything was gone. So now I'm deciding what distro I should try instead. Maybe ArchLinux? Either way, I have to build a lot of software from source and install to /usr/local or custom prefixes for development.

The easiest way would be to install into a VM in OS X. But then, I want to fix issues with drivers, especially GPU switching in the macbook.  On the other hand, I spent ages fixing switchable graphics on an older Sony VAIO laptop and resorted to an ugly hack to force Intel GPU. Since GPU switching is not working properly in Linux, maybe I should write a graphics multiplexor driver for FreeBSD and ditch Linux altogether? FreeBSD looks much more attractive these days with clang and no systemd and no and no Lennart Poettering.