Saturday, October 5, 2013

Sleeping considered harmful


Synchronization is a very controversial topic in computer science and software engineering. We're caught up in a vicious circle where everyone is told syncronization is difficult and therefore tries to stay away from the topic, and when one has to deal with threading, they prefer to stay away from reading about synchronization because reading is even more difficult.

One thing I have noticed while developing linux software for various devices is that drivers written by hardware vendors inevitably suck. Today I will tell you about one thing that freaks me out (sure there's a whole universe of things that freak me out but that's beyond the scope of this post).

Now, sometimes, when dealing with hardware, one has to wait (sleep) some time after writing to registers to let the device change its state. One prominent example are the drivers for the LED controllers. Yeah, your fancy new Android paperweight probably has a color LED to notify when your friends sign on to FaceBoob. LEDs can have various triggers, for example, one can set a LED to blink when memory is accessed, just like on old desktop boxes with HDDs. Somehow every time you try to set a trigger to a LED in Android kernel (as opposed to using the Android's proprietary ioctls), the device just freezes dead. What actually happens?

If we take a look at the LED class definition in the linux kernel, we'll see that the functions controlling the LED state (which are to be implemented by the hardware vendor) must not use sleeping primitives like msleep or usleep. Why is such a requirement posed? There are two reasons. One is that LEDs can be used in a context that does not allow sleeping - for example, inside the scheduler code to indicate the CPU load. Sleeping while scheduling could be handled by kernel developers, of course, but such an operation will cause time drifting and increase the complexity of the scheduler. Another reason is that LED control code may be called thousands of times per second, and any delay will cause the huge CPU load.

Anyway, even the performance critical code needs to be correct, therefore the need for the synchronization. The lightweight synchronization primitive used throughout the whole kernel is the spinlock. As you may guess from its name, it does not block the calling thread (like a semaphore or a mutex would do), but busy-waits till it can get into critical section. Polling (or busy-waiting) has always been discouraged because it consumes CPU time. However, it allows the algorithm to progress instead of blocking (therefore, most lock-free algorithms use memory barriers and busy waiting which is equivalent to using a spin lock in kernel). Either way, what you should need about spinlocks is that in most cases they are used to guard writing to an IO register which is a very fast operation, and while a spinlock is held, interrupts are disabled, therefore if you called sleep(), your thread would have no way to resume.

And yeah, most newbie driver developers don't know that some external busses (like I2C) require sleeping due to the nature of hardware. So, when one issues an I2C transfer inside the spinlock, they just want to see the kernel freeze.

Now that we've figured out what causes freezing, let's think of what we can do about it. On the one hand, if we have enough money, we can just assume customers will accept any crap we sell since they have no alternative. And that does work for Samsung, Qualcomm, Sony and many other vendors. On the other hand, we may want to fix it. We know we can't sleep in a lock, but we need a lock to keep data consistent. Linux has several solutions to the rescue and I'll give an overview of the two of them that are easy and suitable for a typical embedded/desktop system
(i.e., an SMP processor with a small number of cores).


First solution is using a workqueue. A workqueue is an abstraction (on top of kernel threads) which allows to schedule jobs (work) to be executed at some moment later. You can think of a workqueue as of a "future" or "async" concept used in userland.

I will use the example of a LED driver I have written for one smartphone (HTC Kovsky also known as Sony Ericsson Xperia X1) to illustrate the workqueue.

First, we need to include the relevant header. Don't trust me here, stuff changes all the time in linux, so you'd rather use grep to find the actual header.
  1. #include <linux/workqueue.h>

We need to declare a "work" that we want to schedule. A work is essentially a routine which would be called, that accepts the instance of a "struct work_struct" class. We can use the "DECLARE_WORK" macro to define a work.

  1. static DECLARE_WORK(colorled_wq, htckovsky_update_color_leds);

Now, let's look at how to implement the work function.

  1. static void htckovsky_update_button_light(struct work_struct* work) {
  2. char buffer[3] = {MICROP_KEYPAD_BRIGHTNESS_KOVS, 0, 0};
  3. char brightness = kovsky_leds[BUTTONS].brightness;
  4. if (brightness) {
  5. buffer[1] = 0x94;
  6. buffer[2] = brightness >> 2;
  7. }
  8. microp_ng_write(client, buffer, 3);
  9. }

As one may notice, we're using the microp_ng_write function which is actually initiating an I2C transfer. We could also add an msleep() call or even add a mutex and that would still work.

Now, how do we call the code? Simple, in your non-sleeping routine (which may be an interrupt handler or an implementation of a function from the LED class), call the "schedule_work" routine.

  1. static void htckovsky_set_brightness_color(struct led_classdev *led_cdev,
  2. enum led_brightness brightness)
  3. {
  4. schedule_work(&colorled_wq);
  5. }

Now, what if you wanted to pass some data to the work routine without using the global variables? The macro "container_of" comes to rescue! Container_of is essentially an implementation of the OOP-like classes in C based on routines and composition instead of inheritance. It works like the following: assume you have a routine which only accepts the "struct work_struct" pointer, and you want to pass arbitrary data. Modifying linux code and creating a copy of the workqueue implementation each time would be silly and boring (although that's what windows developers would do anyway). Instead, we create a structure to hold our data, and add the pointer to the workqueue as a member to it. The container_of macro calculates the offset of a workqueue in our structure, and substracts it from the workqueue pointer. Woops, we have a pointer to our structure and can unmarshal our data from it. This works like the following:

struct work_struct *our_work = kmalloc(sizeof(struct work_struct), GFP_KERNEL);
/* Init the work structure */
INIT_WORK(our_work, handler_routine);
struct my_work_data {
struct work_struct *cool_work;
} work = {
cool_work = our_work;

The full source code of the driver is available at the corresponding linux tree at gitorious. It does not synchronize the LED state, though. One should be careful when writing code without synchronization: even though for a LED which updates thousands of times per second, displaying a mixed state is not an issue (a human eye will not notice glitches), if you go this way, only keep your "virtual", driver state global and concurrently accessible, but make sure that the data written to the hardware registers can not be modified concurrently!

Kernel Threads
Another alternative is using the kernel threads. In fact, kernel threads are almost like a userspace process running in kernel mode. It can receive signals and do many other things, but programming kernel threads is more difficult and almost all the time you can get away with using workqueues. Here's some sample code showing how to spawn a thread and do your dirty business inside:

static int my_thread(void *v) {
 struct completion exit;
 siginfo_t info;


 while (1) {
  if (signal_pending(current)) {
   if (dequeue_signal_lock(current, &current->blocked, &info) == SIGKILL) {
    goto cleanup;
  else {
   //Do your very bad stuff here

 complete_and_exit(&exit, 0);
 return 0;

//should be somewhere in the driver data
struct task_struct *my_task = NULL;

static void start_thread(void) {
 void *arg = NULL; //Can be used to pass data to the handler
 my_task = kthread_create(my_thread, arg, "mydriver_baddaemond");

static void stop_thread(void) {
 send_sig(SIGKILL, my_task, 0);
 my_thread_task = NULL;

Now you know how to make your drivers more stable and less power-hungry. Almost every time you are thinking of an ad-hoc implementation of a common concept, be sure linux has a clean and working solution already.

Friday, October 4, 2013

[russian] Уязвимость загрузки APK в Android

Это - перевод статьи-анализа уязвимости. Оригинал -

Не успели интернет-массы отбурлить после недавно обнаруженной группой Bluebox Security уязвимости в ОС Android, позволяющей заменить код установленного приложения без нарушения цифровой подписи, как в системе загрузки Apk была обнаружена еще одна дырка.

Немного теории.

Многие думают, что сам факт написания программы в управляемой среде типа Java автоматически делает их код неуязвимым и безопасным. Но забывают о том, что у JVM есть довольно строгая спецификация на низкоуровневые взаимодействия с памятью и арифметику. Давайте рассмотрим простой пример. Скорее всего, внимательные программисты сразу поймут детали уязвимости, а для тех, кто не поймет - будет пояснение.

public class JavaTest {
    public static void main (String [] args) {
        short a = (short) 0xFFFF;
        int b;
        b = a;
        System.out.println (b);
        b = a & 0xFFFF;
        System.out.println (b);

Просто, да? Если нет, то давайте посмотрим, во что это реально исполняется. Я тут быстренько написал в REPL Clojure - он на JVM, да и результаты будут одинаковы в большинстве не-JVM языков.

Clojure 1.1.0
user=> (def a (short 0xffff))
user=> (def b (int a))
user=> (def c (int (and a 0xffff)))
user=> a
user=> b
user=> c

В чем дело? В расширении знаковых типов (sign extension). Казалось бы, основа основ, которую вбивают со школьной скамьи, ан нет - эта бяка ответственна за половину известных уязвимостей (за остальную половину - неинициализированная память). 0xffff, когда мы кладем его в 16-битный short, превращается в -1 (дополнительный код). Согласно правилам расширения знака, если мы присваиваем более широкому типу (в данном случае - 32-битному int) отрицательное значение, оно преобразуется так, чтобы представление в новом формате хранило то же значение, что исходная переменная. То есть, мы получаем int b = 0xFFFFFFFF. Старший (знаковый) бит стоит => лополнительный код => -1. Когда мы отсекаем старшие биты (b & 0xffff), старший бит обнуляется, и мы получаем 0xffff. Но у нас же int, 32 бита - это всего лишь 65535 в прямом коде.

Формат ZIP

Давайте рассмотрим формат архивов ZIP, которым на самом деле являются Jar и Apk архивы (примечание переводчика - на платформе Android Apk обычно часть данных типа картинок не сжимается, и выравнивается на 4 байта утилитой ZipAlign, чтобы можно было отображать ресурсы в память через системный вызов mmap, но это не влияет на совместимость с другими реализациями ZIP). Каждая запись в Zip архиве (абстракцией для которой является Java класс ZipEntry) имеет заголовок следующего вида:

        local file header signature     4 bytes  (0x04034b50)
        version needed to extract       2 bytes
        general purpose bit flag        2 bytes
        compression method              2 bytes
        last mod file time              2 bytes
        last mod file date              2 bytes
        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes
        filename length                 2 bytes
        extra field length              2 bytes

        filename (variable size)
        extra field (variable size)
Обратите внимание на то, что два последних поля (имя файла и дополнительное поле, которое хранится после данных файла), имеют переменную длину.

Проверка Apk в системе Android

При загрузке apk файла происходит проверка контрольной суммы. Функция getInputStream, которая позволяет получить поток для чтения данных из ZipEntry, выглядит примерно следующим образом:

RAFStream rafstrm = new RAFStream(raf, entry.mLocalHeaderRelOffset + 28);
DataInputStream is = new DataInputStream(rafstrm);
int localExtraLenOrWhatever = Short.reverseBytes(is.readShort());
//skip the name and this "extra" data or whatever it is: 
rafstrm.skip(entry.nameLength + localExtraLenOrWhatever);
rafstrm.mLength = rafstrm.mOffset + entry.compressedSize;
if (entry.compressionMethod == ZipEntry.DEFLATED) {
    int bufSize = Math.max(1024, (int) Math.min(entry.getSize(), 65535L));
    return new ZipInflaterInputStream(rafstrm, new Inflater(true), bufSize, entry);
else {
    return rafstrm;

Обратите внимание на выделенные фрагменты кода. Как мы видим, максимальная длина "экстра" данных - 65535 байт (для хранения используется двухбайтовое беззнаковое число). Проблема в том, что в Java нет беззнаковых типов данных, и максимальная длина - 32768 (2 ^ 15). А что, если записать в заголовок Zip большее значение? Правильно, старший бит будет интерпретирован как знаковый, и в соответствии с расширением знаковых типов, переменная localExtraLenOrWhatever получит отрицательное значение.

Таким образом, вектор атаки очевиден - записать отрицательное значение в длину поля "extra", тем самым получив возможность перезаписать имя DEX файла. Небольшое ограничение - для проведения такой атаки необходимо, чтобы размер DEX кода в исходном APK был меньше 64кб.

На сегодняшний день уязвимость исправлена, но ошибки, связанные с переполнением целочисленных типов, по-прежнему являются причиной многих уязвимостей. С учётом того, что больше и больше программистов не знают не только о различиях big/little endian, а даже о знаковых типах данных - ситуация будет только ухудшаться :)