I hate software

Why CVE-2022-3602 was not detected by fuzz testing

2022-11-03T07:50:00.004+03:00

So recently a very hyped memory corruption security vulnerability was discovered in the OpenSSL punycode parser.

Some folks including Hanno (https://twitter.com/hanno/status/1587775675397726209) asked why this is still happenning, why no one wrote a fuzzer for the punycode parser and if we as the security community have learned nothing from Heartbleed.

I think we should give the developers the benefit of doubt and assume they were acting in good faith and try to see what could be improved.

In fact, there already exists a fuzz testing harness for the X.509 in the OpenSSL source code.

All of the fuzzers from the OpenSSL source tree are also supposedly automatically deployed to ClusterFuzz via OSS-Fuzz: https://github.com/google/oss-fuzz/blob/master/projects/openssl/build.sh

Examining call chains

Let’s start by examining the call chain for the vulnerable function.

ossl_punycode_decode: called by ossl_a2ulabel
ossl_a2ulabel: ossl_a2ucompare is not really referenced anywhere in C code, only mentioned in documentation.

Let's examine who calls "ossl_a2ulabel" then.

openssl/crypto$ grep -rI ossl_a2ulabel .
./x509/v3_ncons.c: if (ossl_a2ulabel(baseptr, ulabel + 1, &size) <= 0) {

Let's remember the name of this file and examine the coverage produced by the corpus shipped with the OpenSSL source for the X.509 fuzzing harness.

nc_email_eai <- nc_match_single <- nc_match <- NAME_CONSTRAINTS_check, NAME_CONSTRAINTS_check_CN
NAME_CONSTRAINTS_check, NAME_CONSTRAINTS_check_CN <- check_name_constraints in crypto/x509/x509_vfy.c
check_name_constraints <- verify_chain <- X509_verify_cert
X509_verify_cert: this one has A LOT of callers in the OpenSSL code, but was not reached by the fuzzing harness.
X509_verify_cert: (other ways to reach it are circular - looks like we have to call it directly): check_crl_path <- check_crl <- check_cert <- check_revocation <- verify_chain

Examining coverage

Here is what I did:

Compiled the fuzzing harness with coverage (added -fprofile-instr-generate -fcoverage-mapping) before -DPEDANTIC when building fuzzers.
Minimized the x.509 fuzzing corpus to speed up the next step:

./fuzz/x509 -merge=1 fuzz/corpora/x509_min fuzz/corpora/x509

Ran the executable on all input vectors. This is very slow because while parsing is fast, executable takes time to start up. One solution here could be to use the toolchain from OSS-Fuzz which replaces libFuzzer with a library which triages inputs somewhat like AFL persistent mode.

for i in corpora/x509_min/*; do ./x509 $i; mv default.profraw $(basename $i).profraw; done
llvm-profdata-10 merge -sparse *.profraw -o default.profdata
llvm-cov-10 show --output-dir=cov_html --format=html -instr-profile=default.profdata x509

Update (regarding the persistent mode/perf comment above): once you build the harness with coverage flags, there is no need to execute each input file separately, one can just use the "runs=N" option of libFuzzer:

./x509 -runs=3000 ./corpora/x509_min

So, why did this all happen?

My first (un)educated guess was: fuzzing will waste time in ASN.1 deserialization with little time spent on parsing decoded fields. Turns out, it's slightly worse.

Short answer: the code is not reachable by the current corpus and harness. As there exists an X.509 fuzzer, perhaps developers and other folks assumed it could theoretically reach all parsers, but this is not the case.

The file through which it’s reachable (v3_ncons.c) has little coverage.

The specific call chain which we traced to "check_name_constraints" ends up in "crypto/x509/x509_vfy.c" which has ZERO coverage.

(Update): "verify_chain" is reachable in other fuzzers, but it's still not enough.

Jonathan Metzman from Google's Open Source Security Team pointed me to an OSS-Fuzz dashboard where "verify_chain" is reachable.

https://twitter.com/metzmanj/status/1588174176229199873
https://storage.googleapis.com/oss-fuzz-coverage/openssl/reports/20221031/linux/src/openssl/crypto/x509/x509_vfy.c.html#L221

Unfortunately, this is still NOT enough:

"verify_chain" is covered 4.6K times whereas the inner loop of the X.509 fuzzer is invoked 12.4 times so it's reachable but NOT by the X.509 fuzzer.
Whichever harness reaches "verify_chain" (most likely the "server" ssl test but not the X.509 one) needs to be modified to either set up a valid certificate chain for verification or mark the certificate as self-signed so that "build_chain" does not return error
https://twitter.com/astarasikov/status/1588175841615261702

(Update 2): making "verify_chain" and "build_chain" pass (still not there).

I modified the X.509 test by adding some code from the "test_self_signed" function I took from "tests". With that, we can pass the "build_chain" and exercise most of "verify_chain". Unfortunately, name verification still requires a well-formed proxy certificate.

I think the way to go could be to take the code from "test/sslapitest.c", function "test_client_cert_verify_cb", use the provided certificates as input vectors and fuzz them.

Ultimately, one needs to add a custom certificate to the untrusted chain and sign the certificate to be verified with it. As one can see, it's a lot of work which requires getting familiar with using OpenSSL.

https://gist.github.com/astarasikov/4a60bb17499d4351bb27189e5e8ba8f4

What could we try improving?

Write separate parsers for each function (like Hanno did) - for that, it'd be necessary to examine coverage to see 1. where coverage is low and 2. where the code processes decoded ASN.1 elements
Write a harness to cover X509_verify_cert. Looks like this is currently only called from "test" but not "fuzz" tests. While it may be slow to fuzz the verification, it will definitely cover a larger attack surface.

Update: while this function is reachable via "client" and "server" tests, it returns early. To really cover it and the "punycode" parsing, it's necessary to set up certificate chains in these tests, as well as generate valid "proxy" certificates and add them to the corpus.

Periodically examine coverage reports. This is a bit tedious to do manually, but if there was a public access to the OSS-Fuzz coverage report from Google, this would be much easier. Additionally, the OSS-Fuzz Introspector tool could be helpful in identifying roadblocks/unreachable code.

Generally, the Introspector does not always work perfectly and is easy to break - it's a static analysis tool so it gets confused by function pointers and cannot infer anything that happens at runtime, like when your code is heavily using C++ classes or a hashtable for lookups. In case of X.509 code, however, it may work fine - function pointers are mostly used by the ASN.1 code for its internals (which we actualy do NOT want to fuzz or review in most cases) whereas the core top-level logic is reachable by direct function calls from the entrypoint - a good candidate for static analysis.

https://github.com/ossf/fuzz-introspector
https://openssf.org/blog/2022/06/09/introducing-fuzz-introspector-an-openssf-tool-to-improve-fuzzing-coverage/

If we had the harness which could theoretically reach X509_verify_cert

Add well-formed (encoded) ASN.1 elements to the dictionary file (-dict= option for libFuzzer). This one is currently not used by the OpenSSL fuzz/helper.py, but at least oids.txt is used by OSS-Fuzz as the dictionary.
Add well-formed X.509 certificates which make use of the "name constraints" field. And strictly speaking, all other fields too - instead of just storing the libFuzzer-generated corpus in the tree, it would be better to manually provide various inputs exercising difficult functionality. However, as libFuzzer is spending far too much time on ASN.1 and is overloaded with "features", this will likely only uncover new issues during long (days) runs on ClusterFuzz. Whereas parsers for individual leaf functions, as demonstrated by Hanno, can find (some) bugs in mere seconds.

Thanks to:

Hanno for his twitter thread for motivating me to look into this.

My colleagues for introducing me to the Introspector tool.

P.S. Linking to my evergreen tweet https://twitter.com/astarasikov/status/1122364899362054144

building CLVK OpenCL support for Android phones + OpenCV notes

2021-08-13T01:31:00.004+03:00

Compiling CLVK for Android.

Many Android devices, especially Google Pixel, ship without the OpenCL library.

At some point I needed OpenCL for my OpenCV prototyping, and I was also interested in using either CU2CL or a similar project to run CUDA code.

Needless to say, as soon as I saw a project which promised to implement OpenCL on top of Vulkan, I decided to see if I can run it on Android.

It worked fine, although my approach was kind of nasty: integrating the project along with its LLVM library into the app code. That was good enough for prototyping, although the debug binaries took a few hundred megabytes.

The original project: https://github.com/kpet/clvk
I put my notes here: https://github.com/astarasikov/clvk/blob/android_test/README_android.md

I've added instructions to cross-compile this project for Android.
Additionally, I wrote a simple Android app to demonstrate how to integrate the pre-built
OpenCL library and how to deploy the OpenCL compiler ("clspv" binary) to the device.

Two example OpenCL apps are compiled: "clinfo" which prints some basic information and "BitonicSort" from Intel OpenCL demos.
FWIW both of them work so it's a good start.
https://github.com/astarasikov/clvk/tree/android_test

Also, it might be curious to compare the run times for this app on the phone with CLVK and on the desktop with native OpenCL drivers and with CLVK + RADV.
It seems that on desktop CLVK with RADV is 10 times slower than the native driver.

However, since RADV or any other Vulkan driver uses most of the same LLVM-based codegen as the native OpenCL drivers, this difference is very likely caused by some hardcoded allocation size or another similar parameter rather than some major architectural issue.
I have not looked into it yet for lack of time though.

ARM Mali G72 MP18: 96.818924 ms
Qualcomm Adreno 630: 33.18 ms
AMD RX480 with CLVK and RADV: 7.491112 ms.
AMD RX480 with ROCm OpenCL driver: 0.651121 ms.

https://github.com/astarasikov/clvk/blob/android_test/results.txt

OpenCL with CLVK on top of Android GL driver on a device with no OpenCL

Not a real fix, but that's enough to make most OpenCL samples run.

Building OpenCV with OpenCL support for Android.

For my personal project in 2016 I needed to check if it is possible to use the GPGPU accelerated version of OpenCV that is implemented using OpenCL on Android phones.

OpenCL SDK for Android.

Where to get the SDK? Welp. Build one yourself!
To make OpenCV recognize our SDK and build successfully, we need the following things:

OpenCV.mk - can be empty, but the OpenCV build system needs it to be present
Khronos OpenCL headers - can be gathered using OpenCL headers and CL-CPP SDK.
The loadable dynamic libraries - can be pulled from the device. Generally you can use one from any other device with the same architecture because the ABI and API is the same as it is defined by the OpenCL standard.

Here is how out "SDK" tree should look like. For the time being I only used the "armeabi-v7a" architecture, but one can also add the 64-bit binaries to the "arm64-v8a" directory. I've put up the "SDK" to GitHub, but only the header part (which are publicly available from Khronos). You will need to find the "libOpenCL.so" yourself. (If you're using Google Pixel with MSM8996, you can take the proprietary binaries from Xiaomi Mi5 or Zuk Z2 Pro).

├── OpenCV.mk

├── include

│ └── CL

│ ├── cl.h

│ ├── cl.hpp

│ ├── cl_d3d10.h

│ ├── cl_d3d11.h

│ ├── cl_dx9_media_sharing.h

│ ├── cl_dx9_media_sharing_intel.h

│ ├── cl_egl.h

│ ├── cl_ext.h

│ ├── cl_ext_intel.h

│ ├── cl_gl.h

│ ├── cl_gl_ext.h

│ ├── cl_platform.h

│ ├── cl_va_api_media_sharing_intel.h

│ └── opencl.h

└── lib

└── armeabi-v7a

└── libOpenCL.so

Building OpenCV with OpenCL support.

https://gist.github.com/astarasikov/9088745a49401fced5f1a3503b07e593

The most important change is to actually enable the OpenCL on Android in CMakeLists.txt:
OCV_OPTION(WITH_OPENCL "Include OpenCL Runtime support" NOT ANDROID IF (NOT IOS AND NOT WINRT) )

+OCV_OPTION(WITH_OPENCL         "Include OpenCL Runtime support"              (NOT IOS AND NOT WINRT) )

I also had to disable some warning flags and unsupported compiler options in "OpenCVCompilerOptions.cmake"

Compiler Options.
Extra debugging.
CMake options.

Please also see the following blog which describes building OpenCV with OpenCL, although not the case when you don't have a ready-made SDK: http://www.ysagade.nl/2014/11/02/opencv-android-setup/

OpenCL without root.

Many vendors who ship the phones with the official Android (that passes the CTS, compatibility test suite) do not ship the OpenCL drivers. Some vendors (most Chinese ones like such as Xiaomi and Zuk, and also Sony), however, do ship the drivers. If you have root on your phone or are building a custom firmware, you can just take the binaries from the other device's ROM and that's it.

Dynamic linking banned by Google?

Currently it seems that one can still use mmap() and mprotect() to write their own dynamic library loader, but this might get patched in future because Google is looking into both security and control.

What could we do then?
In principle, we could develop an application that would take a bunch of ".so" libraries and produce a single object file (.o) containing the data and code from all the libraries, with all dynamic symbols resolved. In other words, write a static compile-time linker.

The ultimate obstacle to this approach would be if device vendors re-worked the driver architecture in such a way that the OpenCL frontend would not be loaded to the application address space but a separate server. If that happens, the only way forward would be using a device that ships the OpenCL driver or building a custom firmware.

Package all the relevant ".so" objects directly to your application APK.
Paths. For this prototype, I manually edited the paths to the libraries using a hex editor. This part can be automated, but it was good enough for the proof of concept stage.
Essentially packaging a good half of the other phone's firmware into your app :)

Tweaking the application makefiles.

I edited Android.mk to specify the STL version and disable exceptions.

The fail.

OpenCL library loaded fine, but now we had two independent EGL/OpenGL contexts - one from GL and one from CL with no standard way of sharing textures between them.

(2020 edit) In retrospect, I could have hooked "open" routine to steal KGSL file descriptor from GL for CL, but still there might have been some globals shared between libraries.

Zero-copy memory sharing.

It should be observed that all modern SoCs have the shared RAM for the CPU, GPU and other units (such as the camera frontend). Moreover, on more recent models, there is guaranteed cache coherency between the units (typically maintained by the AXI bus). Android used the kernel-side API called ION to manage memory buffers shared between the devices.

It should be possible to map the GPU texture into the CPU memory using either the OpenGL extension or the underlying GraphicBuffer API which is a higher-level interface for ION. I may be looking into this option in future. For now, I've decided to use the phone that ships with the OpenCL driver.

One other direction to explore might be using the Java API. I was able to create a GraphicBuffer from Java using reflection https://gist.github.com/astarasikov/2ebd216fcafa389174b58c6d9913e397 .
Apparently John Carmack has managed to do it the other way round, using SurfaceTexture API ("SurfaceTexture -> Surface, pass the surface in an intent to the other process, turn that into an ANativeWindow -> EGLSurface") https://twitter.com/ID_AA_Carmack/status/776099691586998272 .

EDIT 2021:

You really should just use ImageReader class, that's it.

http://nezarobot.blogspot.com/2016/03/android-surfacetexture-camera2-opencv.html

Further thoughts.

Sadly, it looks like there's not much we (the users/independent developers) can do to change the situation. Some projects like OpenCV indeed contain a lot of code that would be difficult to port to other languages, especially if you don't want to introduce additional bugs in the process.

Perhaps as a developer the best plan is to limit your app to the phones that support OpenCL out of the box or build a custom firmware for the phone to add the drivers. While this limits your app to a very narrow set of devices, it will allow to re-use existing OpenCL code and build a working prototype quickly which is crucial for many projects at the early stage.

P.S.
This line is from the blog draft in 2017 before CLVK came out :)

I think an interesting direction to explore would be to create the custom OpenCL/CUDA driver and runtime that would generate code in GLSL (for OpenCL ES with Compute Shaders) or Metal.

SVE-2019-15230: A bug collision

2020-08-05T17:22:00.000+03:00

Researchers from Team T5 recently published their write-up on exploiting a bug in S-Boot and obtaining code execution in the Samsung Secure Bootloader (S-Boot).

This week, they're going to present it at the BlackHat 2020 conference.

Their write-up contains a lot of technical details and I recommend you to read it.

https://teamt5.org/en/posts/blackhat-s-talk-breaking-samsung-s-root-of-trust-exploiting-samsung-secure-boot/

In their report, they say "Another security research team found this vulnerability at the same time and report it to Samsung. ID: SVE-2019-15230".

This one-man security research team was me.

As I described in the previous two posts, in 2019 I got myself a Samsung Galaxy S10 phone with an Exynos SoC and decided to hunt for security bugs.

After finding the first issue (which is also my first SVE and my first report rated "Critical"), "SVE-2019-14371", I decided to carefully review the code around the location where I found the first bug.

I found an integer overflow which could potentially lead to memory corruption, overriding the entirety of S-Boot code and data.

Funnily enough, I got the SVE even though I have not submitted the POC which achieves code execution (well, to be fair, I reported it almost two months earlier).

I have submitted the one which demonstrates that the device handles non-underflowed values correctly whereas a "huge" buffer size causes it to freeze.

I came to the conclusion it's hard to exploit because I could not find a device with the good memory layout.

I have downloaded a ton of images: for S10, S9, S7, A50, J series phones.

Team T5's trick was to find a condition where the error handler code will make the memory layout good.

I have unfortunately overlooked that in S8 the download buffer is right before the S-Boot code AND it was not using the newer ("compressed" or "smp") download modes.

Since I never realized how to make that S-Boot falls back to the "legacy" buffer at 0xc0000000, I was focusing on the first underflow here, and came to the conclusion that there's not much I could do about it.

I wrote in my report to Samsung that USB transfer is done with DMA and I have not seen S-Boot initialize SMMU so it's surely exploitable.

If we could make the buffer point before S-Boot, of course (which I have not found out how to do).

I like how Team T5 discusses that normally it would be hard to exploit the bug as most code and data could be cached, but as they found a bunch of pointers in an uncached area, overwriting them works, even if it's done by the USB controller (which is not necessarily cache-coherent) and the CPU is unaware.

Although it was already the third Critical issue I found in S-Boot by that time and I was completely burnt out on trying to develop yet another POC.

Is the security of Samsung phones bad? NO.

I think, it's quite on par with the competitors. While the Android Security Updates page (https://security.samsungmobile.com/securityUpdate.smsb) regularly lists "High" and "Critical" issues, very few of them happen in the S-Boot bootloader (which is one of the earliest pieces of code which execute on the device). What it means is that the attacker need to first unlock the phone. So keeping your phone locked and BT/WLAN off should give a reasonable level of protection in the settings when you can't keep an eye on your phone.

There are of course things that could be improved, like adding some mitigations to the bootloader (stack cookies, heap guard pages). However, in this case this would not help at all, because the memory copy (and thus overwriting code/data) is not done by the S-Boot code, but by the USB DMA controller.

Bootloaders almost never implement ASLR, and even kernels which do only implement it for the virtual memory, the physical address remains constant or predictable. In fact, as it's overwriting uncached code (exception handlers), it could even work if the CPU supported ARM MTE. So this is in some sense the mother of all S-boot bugs.

In some aspect, the root cause here is very similar to the IROM bug found by Frederic Basse: an integer overflow and copying data from USB, although in case of IROM it seems that IROM's code is copying USB data by small chunks.

https://fredericb.info/2020/06/exynos-usbdl-unsigned-code-loader-for-exynos-bootrom.html#exynos-usbdl-unsigned-code-loader-for-exynos-bootrom

Of course it would not be fair to judge the design decisions now that we know about these bugs. It would be nice to add some checks to ensure DMA regions don't intersect with code/data. Enabling SysMMU would not hurt. This is becoming somewhat worrisome now that USB4 has been announced with Thunderbolt-like DMA capabilities. It's unfortunate that most bootloaders do not focus on this.

Now, the problem is that adding mitigations is quite hard as well as reasoning about their effectiveness. Without memory safety you can never be sure that the code is not exploitable. And as we've just seen before, it's unlikely that even a hardware safety oriented at memory safety are likely to be bypassed. Maybe a combination of MTE/KASAN and instrumenting all DMA memory management would work, but again it relies on the individual developers thinking of all corner cases. At some point bootloaders/firmwares could become as complex as the Linux kernel itself.

To this end, an interesting approach is moving firmware update, including USB download, to user-space and indeed running it from Linux, as Google started doing recently.

https://source.android.com/devices/bootloader/fastbootd

Why are the bugs present then?

It often happens that bugs appear in two areas: the just-written code with new functionality (which not so many people had a chance to review) or the old code (people got tired of trying to find bugs there and gave up).

I think we can draw two conclusions from this.

As a vendor, one should not assume that security or code review is a one-off effort and they need to re-review their stuff once in a while, especially bringing in the perspective of new team members.
As the user, if you expect maximum security, perhaps don't switch to the new tech immediately, give it 2-3 months for the most obvious and annoying issues (not only security bugs) to get ironed out.

What can I do to protect myself?

Enable the "Find My Mobile" feature of your Samsung phone.

Before I and TeamT5 reported a few issues to Samsung, S-Boot implemented one mechanism of preventing the phone from being tampered with: when MDM (device administration by a workspace or school) is enabled, certain partitions were not allowed to be flashed.

This, unfortunately, did not protect the user in two scenarios: when they were not using MDM (which is the majoriy of users) or when the vulnerability is in the USB stack before the flashing code (in which case even enabling MDM would buy you nothing).

In mid-2019, Samsung changed the "Find My Mobile" in such way that it would disallow any USB operations (such as ODIN mode) when the device is not unlocked and FMM is enabled.

This should provide a reasonable level of protection for most users.

It is now much harder for an attacker to hack your phone in a few seconds when you're not looking at it (such as at a conference or in the hotel room). Of course, if is still possible to re-program the storage chip by desoldering it, which could expose other potential vulnerabilities, but it's hard to do it quickly and without tamper evidence.

Will you blog about other findings?

Not sure. It's a tricky question.

In general, Samsung is asking to not disclose the issues, at least until the patching and reward process is done, which is understandable. I am grateful that Samsung allowed me to participate in their security reporting program, even though I work for another mobile SoC company, so I'm not particularly interested in making this relationship go sour.

I do, however, see the immense value in describing the bug contents, because personally for me write-ups on Phrack or later Google Project Zero have been useful for understanding how attackers think, which came handy when both writing code and later working as a security engineer.

I guess vendors can have their own reasons to not like when bugs are disclosed. Before working as a security engineer myself, I thought negative PR/news was the major reason. Turns out, it's impossible to predict this factor, and often the minor bugs are over-hyped but serious ones go unnoticed. So maybe this factor is not that important after all.

Another thing I've noticed is that there is a significant interest in some hacking and "mobile repair" forums in bugs even for older firmware revisions, which means there are regions where people rarely update (expensive internet) or... a source of phones which for some reason remain unused and therefore not updated for a while.

Given that mobile phones are supported (security updates and carrier contracts), maybe half of this term (1.5-2 years from the time the bug is patched) is a reasonable delay for holding off disclosure.

On Samsung and Exynos hacking, again

2020-05-05T03:14:00.002+03:00

Introduction.

Last year I published a post (http://allsoftwaresucks.blogspot.com/2019/05/reverse-engineering-samsung-exynos-9820.html) about reverse-engineering TEEGRIS and S-Boot
on Samsung Exynos Galaxy S10. This is kind of a follow-up to that post
which has received a lot of attention and led to interesting conversations
with fellow security researchers.

Funnily enough, this very blog with its distinctive URL got into academic papers and
conference talks.
I guess that counts as a success because that's more citations than all my
previous academic work combined. Slowly but steadily I'm progressing on track to receive my PhD from the Shitposting University.

Citations.

(All links have been retrieved on 2020-04-17).
https://gsec.hitb.org/materials/sg2019/D2%20-%20Launching%20Feedback-Driven%20Fuzzing%20on%20TrustZone%20TEE%20-%20Andrey%20Akimov.pdf
Andrey Akimov: Launching feedback-driven fuzzing on TrustZone TEE (HITB GSEC 2019 Singapore).

https://zeronights.ru/wp-content/themes/zeronights-2019/public/materials/5_ZN2019_andrej_akimovLaunching_feedbackdriven_fuzzing_on_TrustZone_TEE.pdf
Andrey Akimov : Launching Feedback-Driven Fuzzing on TrustZone TEE (ZeroNights 2019)

https://blog.quarkslab.com/a-deep-dive-into-samsungs-trustzone-part-1.html
Alexandre Adamski, Joffrey Guilbon, Maxime Peterlin of Quarkslab : A Deep Dive Into Samsung's TrustZone (Part 1)

https://www.usenix.org/system/files/sec20summer_harrison_prepub.pdf
Lee Harrison, Hayawardh Vijayakumar, Rohan Padhye , Koushik Sen , and Michael Grace: PARTEMU: Enabling Dynamic Analysis of Real-World TrustZone Software Using Emulation

https://www.ndss-symposium.org/wp-content/uploads/2020/04/bar2020-23014.pdf
Marcel Busch and Kalle Dirsch : Finding 1-Day Vulnerabilities in Trusted Applications using Selective Symbolic Execution

Follow-up on reverse-engineering and security research.

I also found a few bugs in TEEGRIS and S-Boot that got assigned CVEs by Samsung (check 2019/2020 here).
I'm somewhat happy about this achievement. Prior to that, I mostly worked on
the defense side both implementing mitigations/OS kernels and then debugging
security issues submitted by other researchers. So I was glad to receive this
external validation of my ability to find bugs on my own, although a little
bit surprised at how easy it was to find them with the code review of the
code decompiled with Ghidra.

I have not really found any bugs with fuzzing using the QEMU emulators for
S-Boot and TEEGRIS described in my previous blog post. However, these came
handy for debugging proof-of-concepts as I could use GDB and dump memory as if
it was just a regular Linux app on the PC.

I would also like to point your attention to this paper on Phrack
about emulating RKP (Samsung Hypervisor) with QEMU by Aris Thallas.
http://phrack.org/papers/emulating_hypervisors_samsung_rkp.html

I have used a similar approach with full-system QEMU emulation for debugging some RKP bugs.
However, after having spent so much effort on emulating S-Boot and TEEGRIS,
I was not in the mood to boot Linux in EL1 and put all the pieces together.
I used a different approach for testing Hypervisor Calls (HVCs). Instead
of having a proper EL1 client, I wrote a piece of C code that invoked the
EL2 exception handler directly. I then linked it to the address of some
uninteresting function in RKP and used GDB to overwrite the code in QEMU
memory and jump to my stub.

I especially like the part about using QEMU instrumentation to provide
coverage information to AFL.
I have also implemented a similar approach (based on the QEMU and Unicorn modes
from the AFL source tree) for my TEEGRIS QEMU emulator.
https://github.com/astarasikov/qemu/commits/teegris_usermode_persist_rewriteafl
https://twitter.com/astarasikov/status/1187902865710428160

Unfortunately, I have not found any bugs with fuzzing (although I have with code review).
I believe better results could be achieved with the CompareCoverage plugin which
would prevent the fuzzer from getting stuck on magic values/constants.
https://andreafioraldi.github.io/articles/2019/07/20/aflpp-qemu-compcov.html
Additionally, please check out this blog about implementing ASAN (Address Santiizer)
for binary-mode QEMU within the TCG interpreter/JIT.
https://andreafioraldi.github.io/articles/2019/12/20/sanitized-emulation-with-qasan.html

Finally, if you're interested in fuzzing at the source-code level and are
getting stuck with magic values/constants, please check out this
post from 2016 about a strategy for splitting comparisons (which is related
to CompareCoverage).
https://lafintel.wordpress.com/

This is already implemented in libFuzzer, but
if you have to use AFL, consider using AFL++ which maintains LLVM plugins
for these strategies. In any case, check out AFL++ because it attempts to unify
most of the forks developed in academia.
https://github.com/AFLplusplus/AFLplusplus

Other interesting news.

I9100 (Samsung Galaxy S2) upstream work.

I was surprised when I got a GitHub notification in 2020 about a project I have
not worked on before. Turns out, people have been resurrecting the work I've
done in back 2012 which was a nice surprise.

In 2012 I was doing some work on getting FOSS software to run on
the Samsung Galaxy S2 phone. It was a hobby project, I got this phone
after completing my work on porting Linux and Android to Sony Xperia X1 and
hoped that starting with a device which ran Linux out of the box would be
advantageous for this goal.

So the first problem that I solved was getting multi-boot working.
I solved it by porting the U-Boot bootloader.
This eventually related in a weird chain of events that landed me several interesting
jobs and gigs.

Anyway, the u-boot.

I then attempted porting the Galaxy S2 board support to the mainline kernel tree.
I was using the latest Linaro tree. I had some limited success in getting most
hardware working with upstream drivers (WIFI, Camera with V4L2) and by porting
some non-upstream ones (Sound, Modem).
https://github.com/astarasikov/i9100-proper-linux-kernel/commits/i9100_linaro_33

Eventually I had to resort to using the Android kernel with some changes
but I got dual-boot working with Ubuntu on the SD Card.

Native Ubuntu (with X11) on Samsung Galaxy S2 (2012)
https://www.youtube.com/watch?v=VHl8PytVt50

Back in 2012 I made a post to summarize my efforts related to S2.
https://www.mail-archive.com/smartphones-userland@linuxtogo.org/msg02865.html

Mainline linux port by Sekil

Fast-forward to 2020, I was surprised to learn that not only people are still
using the device, they are also using my U-Boot port and one developer even
went as far as resurrecting the attempts to run mainline linux tree.
They made great progress and independently authored patches for the mainline
tree which have a high chance of being accepted.

See this port by Evgeniy Stenkin.

This effort is acknowledged and is used by the PostmarketOS project.
https://wiki.postmarketos.org/wiki/Samsung_Galaxy_SII_(samsung-i9100)

FOSS RIL for Samsung Galaxy S2, Galaxy Nexus

Later, my focus switched to reverse-engineering the userspace libraries
in order to provide a fully open-source build of Android for Samsung Galaxy Nexus,
a device which shared the modem with Galaxy S2.

For the previous-generation phone (Galaxy S1, I9000) an open-source implementation
of the Radio Interface Layer (RIL) was provided by the engineers from the Replicant
and OpenMoko projects (Paul Kocialkowski, Simon Busch and morphis).

In 2012 I was asked by Ksys Labs to provide an open-source RIL for Samsung
Galaxy Nexus which happened to have the same modem as Galaxy S2.
So I have done the following:

Firmware loader for these modems (based on reverse-engineering and a C++ implementation by another engineer)
Fixing SMS character encoding so that we could receive SMS in Russian
Fixing some edge cases for USSD support
Providing some rudimentary socket callback protocol so that a proprietary GPS library could be used by those who really wanted to.

These changes have been fully integrated into the Replicant project
and served as the basis for supporting many more Samsung modems.
Some builds of LineageOS for Galaxy S3 also use these libraries from the
Replicant project to avoid the overhead of supporting the ABI for the
proprietary driver libraries from 2012.

I even saw the Replicant stand at the CCC last year so these phones
are living on.
And the dream of supporting it in a non-Android setting such as Ofono
seems to have never materialized. Oh well.

Summary

I am happy to see that my work on both U-Boot and RIL got reused by many projects.
Back in 2012 having your phone run upstream software was a very ambitious goal,
especially for a single developer. It usually took around a year and a half
to get familiar with all the hardware and reverse-engineer it to a decent level
in order to develop all the support by which time the device would get obsolete.
However, if you're more interested in using upstream SW than using the latest
HW, there is some hope.

Oh, and Pinephone looks like a nice alternative these days. The hardware is similar to Galaxy S2, but the CPU is 64-bit and it's FOSS out of the box.

U-Boot without the proprietary bootloader.

Here's another interesting development that happened in those years to another
related Exynos device (Galaxy S3 I9300).
Simon Shields ported U-Boot to Galaxy S3, but unlike my port this one
does not rely on the Samsung bootloader in any way and allows to boot the phone
with even fewer proprietary components.
https://blog.forkwhiletrue.me/posts/u-boot-on-galaxy-s3/

Back when I was porting U-Boot to S2, I flashed it into the Linux kernel
partition and made it so that it's loaded by the phone's original bootloader.
My motivation was to avoid bricking the device (back when it was not known
how to use Exynos USB recovery mode) and it was assumed that the bootloader
needed to be signed. As it turned out later around 2014, on these early
Exynos chips the initial bootloader shared the same signing key and device
ID with development boards and it was possible to work around the signing
requirement and replace the original bootloader by using the stage-1 bootloader
from a development board.

KVM on the phone.

Ever since working on the ARM para-virtualization with L4/Genode I wanted
to use real virtualization.
I was very enthusiastic about the first (32-bit) ARM boards with the HYP extension
when they arrived in 2013.
http://allsoftwaresucks.blogspot.com/2013/11/kvm-on-arm-cortex-a15-omap5432-uevm.html

Since then, I've always wanted to get virtualization working on a mobile phone
for the fun of running multiple operating systems.
Unfortunately, most of them enable "secure" booting and require that the EL2
hypervisor image is signed by the OEM.

Some early phones did not implement a hypervisor or left it writable by the OS
but I was wondering if I could do that on a fairly recent and powerful phone.

Here's some small showcase of an attempt to run Windows 10 in KVM on a Samsung
A50 phone with the Exynos9610 CPU.

The bug I found works only on the unlocked phone (with KNOX tripped/fuse blown) before Linux MMU
is on. In principle one might be able to find a variant that works with MMU on,
but even passing arbitrary arguments to RKP would require compromising (rooting)
Linux first. Therefore, this bug does not (IMHO) have a big security impact
(because on older generation Exynos RKP/EL2 was only used for the kernel
memory protection and ROPP/JOPP but not for IOMMU) but is interesting for research purposes.

This is in no way a statement on the security of Samsung devices. I think
their efforts are definitely above average for Android. However, given enough time
any system can be broken, even the ones previously regarded as unbreakable such
as PS4 or iPhone with PAC. Here, patching timely before the issues get disclosed
is important and looks like things have improved a lot in the Android world recently.
https://www.zdnet.com/article/android-oem-patch-rates-have-improved-with-nokia-and-google-leading-the-charge/

The bug has been patched in October 2019 anyway so users with the latest updates
should not be affected (SVE-2019-15221, SVE-2019-15143).

What I've also learnt from watching a lot of talks and following the discussions
by other researchers is that security issues often concentrate in two areas:
where no one has looked before, and where many people have looked and then gave
up because they decided that they found all the low-hanging fruits. So RKP seemed
like an interesting target given the previous research from Google Project Zero
in 2017 (https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-bypassing-samsungs.html).

I will not be providing additional details on that bug but here are some nice
screenshots and videos:

Ubuntu X11 running on Samsung Galaxy A50. KVM guest runs Windows 10.
Here, we can see that the colors are swapped as the framebuffer driver is confiruged
to output BGR instead of RGB by default in Android.

Video of UEFI booting Windows 10 installer in KVM.
https://twitter.com/astarasikov/status/1249904283098796033

A mysterious BSOD (yes it's actually supposed to be blue) in the USB controller
driver, possibly related to how the controller is emulated in QEMU.

Unfortunately for now I had to stop further work on this project because I accidentally
upgraded the phone to the latest firmware revision and now due to the rollback protection
I can no longer install the vulnerable RKP image.

If you're interested in this kind of stuff, there are good news.
Recently a few open-source phones have appeared which do not enforce secure boot/
signature verification and you can run KVM (or any other hypervisor) out of the box.

For example, multiple people have reported getting KVM and Windows 10 working
on the Pinephone and Pinebook.
Pinephone has a Cortex-A7 CPU with an old Mali GPU so in terms of hardware
it's almost an exact copy of the Galaxy S2 discussed above, but it's more
FOSS-friendly.

https://twitter.com/RealDanct12/status/1231607283412426756
https://twitter.com/Manawyrm/status/1197981073101271040

Reverse-engineering Samsung Exynos 9820 bootloader and TZ

2019-05-29T05:04:00.002+03:00

Reverse-engineering Samsung S10 TEEGRIS TrustZone OS

It's been a while since my last post, huh?
Even though I have quite a lot of stuff I'm planning to write about, time is very limited.

Lately I've been working on reverse engineering and documenting
the S-Boot bootloader and TrustZone OS from the Exynos version
of Samsung Galaxy S10.
TLDR: I can now run S-Boot and TEEGRIS TrustZone TAs in QEMU but too lazy to find bugs.

It's been a while since I had a Samsung phone, my last was Galaxy S2.
It's also been a while since I last looked into bootloader binaries.

Last year I got an Exynos S9 model, mostly because I was impressed by its
CPU benchmark scores and wanted to run my own code to measure it.
This year I got some spare time but since S10 came out and a lot of people
have already looked at S9 software, I've decided to start reverse engineering
the software from S10.

S-Boot bootloader image layout.

github gist

0x0: probably EPBL (early primitive bootloader) with some USB support
0x13C00: ACPM (Access Control and Power Management?)
0x27800: some PM-related code
0x4CC00: some tables with PM parameters
... -> either charger mode code or PMIC firmware
0xA4000: BL2, the actual s-boot
0x19E000: TEEGRIS SPKG (CSMC)
0x19E02B: TEEGRIS SPKG ELF start (cut from here to load into the dissasembler). This probably stands for "Crypto SMC" or "Checkpoint SMC". This handles some SMC calls from the bootloader as part of Secure Boot for Linux.
0x1ACE00: TEEGRIS SPKG (FP_CSMC)
0x1ACE2B: TEEGRIS FP_CSMC (ELF header). My guess is that it's related to the Fingerprint sensor because all it does is set some registers in the GPIO block and USI block (whatever it is).
0x264000: TEEGRIS kernel, relocate to 0xfffffffff0000000 to resolve relocations
0x29e000: EL1 VBAR for TEEGRIS kernel. fffffffff0041630: syscall table, first entry is zero.
0x2D4000: startup_loader package
0x2D4028: startup_loader ELF start. This one's invoked by S-Boot to read the TEEGRIS kernel either from Linux kernel via shared memory or from the LZ4 archive compiled into S-Boot.

There's also one encrypted region containing ARM Trusted Firmware which is EL3 monitor code. It's right after the bunch of Rijndael substitution box constants.

Running S-Boot in QEMU.

I've long wanted to run S-Boot in QEMU for reverse engineering it.
I think I've mentioned this idea to my colleague Fred 2 years ago which kind of motivated him to write this great post about Exynos4210 early bootloader in SROM.
Check out his blog if you're interested in Samsung, btw.
https://fredericb.info/2018/03/emulating-exynos-4210-bootrom-in-qemu.html

Long story short, with a bit of hacks to emulate some MMIO peripherals I've prepared the patch for QEMU to run S-Boot from Exynos9820.
QEMU Support for Exynos9820 S-Boot

SCTLR_EL3 register

According to ARM ARM, top half of SCTLR is Undefined.
Samsung reused them to store the base address for the S-Boot bootloader.
When running in EL3, part of SCTLR is used when computing the value to write to VBAR registers which point to the Exception Table.
I initially attempted running S-Boot in EL3 but it checks EL at runtime and I believe it's actually running at EL1 but the binary supports EL1, EL2 and EL3.

Re-enabling debugging prints

Turns out, early in the boot process the bootloader disables most of the debugging logging.
I've prepared the GDB script to work around that.
gdbscript
set *(int*)0x8f16403c = 0

UART

https://github.com/astarasikov/qemu/blob/exynos9820/hw/arm/virt.c#L1900
As usual (WM5 blog [http://allsoftwaresucks.blogspot.com/2016/10/running-without-arm-and-leg-windows.html]), we can solve it by making the MMIO Read request return different data on subsequent reads.
We simply invert the value in cache on each invokation.
Using this trick we can bypass busy loops which wait for some bits to be set or cleared.

In fact, emulating two UART registers, status and TX, is enough to get debugging output from the bootloader.

Peripherals

We can identify some peripherals either by looking up their addresses in Linux Device Tree files
or by analysing what is done by the code that accesses them.
For example, we can easily identify Timer registers.

EL3 Monitor emulation.

S-Boot calls into the Monitor code (ARM Trusted Firmware) to do some crypto and EFUSE-related operations.
These calls have argument numbers starting with a lot of FFFFFF.
It was necessary to enable the "PSCI conduit" in QEMU which intercepts some SMC calls and add a simple
handler to allow S-Boot to properly start without crashing.
arm_is_psci_call
if ((param & 0xfffff000) == 0xfffff000) {
//Exynos SROM
return true;

Putting all the pieces together: running it.

./aarch64-softmmu/qemu-system-aarch64 -m 2048 -M virt -serial stdio -bios ~/Downloads/sw/pda/s10/BL/sboot_bl2.bin -s -S 2>/dev/null

At this point, we're not emulating most peripherals like I2C, PMIC, USB.
However, the bootloader gets to the point where memory allocator and printing subsystem is initialized which should be enough
to fuzz-test some parsers if we hook UFS/MMC access functions.

General approach to reverse-engineering

Samsung leaves a lot of debugging prints in their binaries.
Even in the RKP hypervisor, although most strings are obfuscated by getting replaced with their hashes,
some strings in the exception handler are not obfuscated at all.
With this knowledge, it's easy to identify the logging function, snprintf
and then strcpy, memcpy. Memcpy and strcpy are often near malloc and free.
Knowing this functions it's trivial to reverse-engineer the rest.

TEEGRIS intro

In the Exynos version of Galaxy S10, Samsung have replaced
the TrustZone OS from MobiCore with their solution called TEEGRIS.

As we've seen before, TEEGRIS kernel and loader are located inside
the BL image along with S-Boot.
Userspace portion - dynamic libraries and TAs (Trusted Applications)
reside in two locations:

System partition ("/system/tee"):
A TAR-like archive linked into the Linux Kernel

Here is what we can find:

00000000-0000-0000-0000-4b45594d5354 (notice how 4b 45 49 4d 53 54 are ASCII codes for "KEYMST" (Key Master))
00000000-0000-0000-0000-564c544b5052 VLTKPR (Vault Keeper)
00000005-0005-0005-0505-050505050505 - TSS (TEE Shared Memory Server?)
00000007-0007-0007-0707-070707070707 - ACSD (Access Control and Signing Driver?) basically the loader for TAs with a built-in X.509 parser

I wrote a Python script to unpack the (uncompressed) TZAR files.
https://gist.github.com/astarasikov/f47cb7f46b5193872f376fa0ea842e4b#file-unpack_startup_tzar-py
After unpacking the file "startup.tzar" from S10 kernel tree (LINK)
we can see that it contains a bunch of libraries as well as two TEE applications
which can be identified by their file names resembling GUIDs.

Security mechanisms

Boot Time: TEEGRIS kernel and startup_loader reside in the same partition as S-Boot so their integrity should be checked by the early bootloader (in SROM).
Run Time: TrustZone applets (TAs) are authenticated using either built-in hashes or X.509 certificates.
Trustlets and TEEGRIS kernel has stack cookies and they are randomized.

All TAs are ELF files which export the symbol "TA_InvokeCommandEntryPoint" which
is where requests from Non-Secure EL1 (and other Secure EL0 TAs) are processed.
Additionally, some extra TZ applets can be found in the "system" partition.

Indentifying TEEGRIS syscalls

Attempt 1 (stupid)

Look for the syscall number and a compare instruction.
For example, for the "recv" syscall, let's search for 0x38, filter results by "cmp".
No Luck. Ok, it's probably using a jump table or a function pointer array instead.

Attempt 2

Let's locate AArch64 exception table and go from there.
We can find it by a bunch of NOPs (1f 20 03 d5) immediately after a block of zero-filled memory.
We can then find the actual exception handler for EL0 by knowing the offset from the ARM ARM.
https://developer.arm.com/docs/100933/latest/aarch64-exception-vector-table

P.S.
In fact, the code which launches "startup_loader" sets VBAR_EL1 to the same
address which we've identified before.

Syscalls

Luckily for us, Samsung put wrappers for each syscall into the library called "Libtzsl.so"
so we can easily recover the syscall names from the index in the table.

TEEGRIS IPC

Curiously, Samsung chose to implement two popular POSIX APIs to communicate
between TAs as well between TAs and REE (Linux): "epoll" and "sendmsg/recvmsg".

Peripherals such as I2C and RPMB are of course handled by file paths with magic
names, like on most UNIX-like kernels.

List of (most) TEEGRIS syscalls

https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11590

TEEGRIS emulator

Since I'm better at reverse engineering than at exploitation
and I like writing emulators but hate code review, I decided to
find a way to run TAs on the Linux laptop instead of the actual
device.

Besides doing full-system emulation, QEMU supports the "user" target.
In this case it loads the target ELF binary into memory and translates
instructions to the host architecture, but instead of blindly passing
syscall arguments to real syscalls it can patch them and do any kind of
emulation.

Here are the changes that I needed to make in order to run TEEGRIS binaries instead of Linux ones:

ELF Entrypoint: setup AUXVALs in a specific order that "bin/libtzld.so" expects
Slightly different ABI: register X7 is used for the syscall number for both ARM32 and ARM64
https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11785
TLS handling (QEMU bug?)

Current Status.

Boots TAs, both 32-bit and 64-bit
Currently does not support launching TAs from TAs (thread_create)
Currently only invalid command handler is reached. Need to improve
recvmsg or patch the library code as a workaround.
But overall it should be possible to build a fuzzer for TAs in less than a week of work now.

Here's one idea: now that we know we've emulated enough of syscalls
for a TA to boot and start message processing, we can just override the
return address and arguments for one of the syscalls which are invoked
in the message processing loop and redirect the execution directly
to TA_InvokeCommandEntryPoint.

For this proof of concept I've manually identified the entry point address
and adjusted it according to the ELF base load address and QEMU-specific load
offset. Of course it would be better to automate this part so that TA loader
is more generic but as every software engineer knows, those who write
good code get scooped by those who don't.

This kind of works in that we're getting messages from inside the TA: check the full log at https://pastebin.com/sVtWk5CD and search for "keymaster [ERR]".
However, it fails early when validating the message contents.
We need to generate the correct ASN.1 payload which should be doable
since ASN.1 grammar templates are compiled into the binary.

Ideas for future research

Hook malloc/free and some other functions and invoke native system C library calls.
Hook QEMU JIT (TCG) or interpreter to check memory accesses against ASAN shadow memory. This way we can enable Address Sanitizer for binary blobs, similarly to how Valgrind does memory debugging. Since QEMU Usermode runs TAs in the same address space as itself, we can use ASAN allocator or libdislocator to detect OOB memory access. Unicorn is kind of hard to use because for this because it does not allow to easily set up MMIO traps, it only allows to register chunks of normal memory.
Finish reverse-engineering ASN.1 format for Keymaster and fuzz this TA.
Run TEEGRIS kernel in QEMU as well to fuzz syscalls.
A Ghidra script to rename functions according to the debug strings passed to invokations of print callees
Look at the ring buffer implementation in the shared memory.

Running TEEGRIS Emulator

export TEE_CMD=777
qemu/teegris$ ../arm-linux-user/qemu-arm -z fuzz_keymaster/in/test0.bin -cpu max ./00000000-0000-0000-0000-4b45594d5354.elf

Debugging panics with GDB

https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11929
Uncomment the code in the "panic" syscall which sets PC to 0x41414141 so that the exception is delivered to GDB.
Add -g 1234 to qemu-arm arguments to listen for GDB.
Run the GDB script in root folder: "qemu$ arm-none-eabi-gdb -x gdbinit"

Related Projects

Post from Daniel Komaromy on reverse-engineering Galaxy S8 which mostly focuses on the other part of the picture: getting from Linux into Secure EL0.

https://medium.com/taszksec/unbox-your-phone-part-i-331bbf44c30c

Blog from Blue Frost Security on reverse-engineering S9 TrustZone. The OS kernel is different but actual TAs are the same.

running without an ARM and a leg: Windows Mobile in QEMU

2016-10-24T14:11:00.002+03:00

Introduction.

One project I had in mind long time ago was getting Windows Mobile to run in QEMU.
I think it's a lovely OS with a long history and the project seemed like a nice tecnhical challenge.

Initially I started working on it two years ago back in 2014 and the plan was to later run it in KVM on Cortex-A15 with Virtualization Extensions. However, I had to suspend it because I started working on two other challenges - running XNU on Xen (aka Virtu.al LLC) and later doing GSoC (running FreeBSD in ARM emulator).

Now since I've got some free time on my hands, I decided to finally get back to this project and cross it off my TODO list.

Choosing the emulation target.

In order to run Windows CE on QEMU (or any OS for that matter) it would be necessary to either develop a Board Support Package with all the drivers for a specific virtual machine or take the opposite approach and emulate some machine for which there already exists a ROM image.

For Windows, there is the emulator developed by Microsoft which is unsurprisingly called just Device Emulator. It emulates a real board - MINI2440 based on Samsung S3C2440 SoC which is an ancient ARMv4 CPU. Turns out, this is the same SoC that's used in OpenMoko so there is an old fork of QEMU with the support for most of the peripherals. So the choice of the platform seemed a no-brainer.

So I took the QEMU fork supporting MINI2440 and tried to adapt it to running the unmodified Windows Mobile images from Microsoft. Needless to say, I made sure the images are placed into memory at the correct addresses but the code seemed to crash spontaneously and never got past enabling MMU.

The first idea that comes to mind is of course to take the latest QEMU and see if it fixes anything. However, trying random changes until something works is actually quite a crappy approach.

So, I decided to single-step the execution and see what happens. QEMU provides the very useful GDB interface (which can be activated with the "-s -S" switches) for this purpose.

Caches are hard.

The first surprise came from the bootloader code. Before launching the kernel, Windows CE bootloader disables the MMU. At this moment QEMU crashes spectacularly. Initially I tried hacking around the issue by adding the code to translate the addresses into the "exec-all.h". However, it didn't solve the problem and the heuristic started looking too complex which suggested I'm on the wrong way.

I started thinking of and realized that disabling MMU is a tricky thing because on ARM Program Counter is usually ahead of the current instruction so the CPU has to fetch a couple instructions ahead. So we have a caching problem. In this Windows CE code, there is a NOP instruction between disabling the MMU and jumping to the new PA from a register. The hypothesis was that QEMU did not fetch the needed instructions correctly and it was necessary to add a translation cache for them. The reality was more funny.

As it turned out, QEMU contained a hack to cache a couple instructions because... because Linux kernel for PXA270 had the same problem in the opposite scenario - when the MMU was enabled. I decided to comment out the hack and it made the boot-up progress further. (PXA Cache Hack).

Stacks are not easy either.

Next thing I know is that soon after enabling the MMU the code crashes when trying to access the stack at a very peculiar virtual address of 0xffffce70. I examined the QEMU translation code and found out that it correctly locates the physical address but the permission bits are incorrect. I decided to force it to return RW access for the particular address range and voila - Windows Mobile boots to Home Screen successfully. (Hack to force RW stack).

Windows Mobile 5.0 Smartphone Edition on QEMU on Mac OS X.

Now that everything seemed to boot, I decided to take another look at the MMU issue and fix it properly. The first idea I had was to compare ARM920T (ARMv4) and ARM1136 (ARMv5) page table formats. It is worth noting that ARMv4 did not have TEX remap bits, and also last level page tables had different type (last two bits). It turned out that QEMU (probably as real ARM CPUs) supported all types of pages, even ARMv4 on ARMv6 target, and TEX/caching bits were simply ignored. After careful examination of the code and all the page table parsing I found a typo that was already fixed in the upstream QEMU (MMU Typo).

You can grab the cleaner tree without the ugly intermediate hacks: https://github.com/astarasikov/qemu/tree/wm5 . If you need to have a look at the older WIP stuff, it's at https://github.com/astarasikov/qemu/tree/wm5_with_hacks .

Running Pocket PC version.

Preparing PocketPC Image.

Windows Mobile Standard aka Smartphone comes shipped as the NOR flash image and it is XIP (Execute-in-Place). Pocket PC Image comes in a different form. It comes as an image already relocated to the RAM so we need to launch it directly via the "-kernel" parameter. We need to create the NOR memory image for it. We can use the smartphone image as a base, but an empty 32MB file should work as well.

Grab the "romtools" scripts at github: https://github.com/pinkavaj/romtools
python b000ff-to-bin.py ../ppc_50/_208PPC_USA_bin
dd if=_208PPC_USA_bin-80070000-81491ed0.bin of=SPHIDPI_BIN bs=$((0x30000)) seek=1 conv=notrunc
./arm-softmmu/qemu-system-arm -show-cursor -m 256 -M mini2440 -serial stdio -kernel ../wm5/50_ppc/_208PPC_USA_bin-80070000-81491ed0.bin -pflash ../wm5/50_ppc/SPHIDPI_BIN

This one did not boot and kept hanging at the boot splash screen. The host CPU usage was high and it indicated there was some busy activity inside the VM like an IRQ storm. After hooking up the debugger it turned out that the VM was hanging in two busy loops.

One of them was in the Audio driver - during the splash screen Windows plays a welcome sound. This was worked around by setting the "TX FIFO Ready" status in the sound codec. The second freeze was in the touchscreen driver but that looks like a MINI2440 emulation bug - once the touchscreen is disabled, the workqueue timer in qemu is permanently disabled and the status bits are not updated properly. I commented out the routine which disabled the touchscren controller since it's a VM anyway (not a real device where it would have a power-saving impact).

Obligatory screensot: Windows Mobile 5.0 Pocket PC

Retarded idea.

I was fantasizing about adding a logic to QEMU which would detect if emulation was stuck invoking MMIO device handler in a loop and fuzzing the returned register value until it was unstuck. While not very practical, one could reverse-engineer the register bits blindly this way.

Further ideas.

I don't think I'll invest more time into this project because there's little value but I'm considering developing an Android port of this QEMU fork just for the fun of it. Perhaps a better option would be to emulate a newer SoC such as Qualcomm 8250 and run an ARMv7 image from HTC HD2.

It's quite sad that most if not all Android phones ship with HYP mode (Hypervisor) disabled so KVM is a no-go. On the other hand, allowing to run custom hypervisors opens up a pandora box of security issues so it might be a good decision. Luckily for us who like to tinker with hardware, most TrustZone implementations also contain exploitable bugs so there are some possibilities to uncover the potential :)

Today, many companies working with embedded SoCs, seek GPU virtualization solutions. The demand is particularly high in the automotive sector where people are forced to use eclectic mixes of relatively old SoCs and retaining binary compatibility. So it would be interesting to prototype a solution similar to Intel's GVT.

I think it would be nice to use Qualcomm Adreno GPU as an emulation target. It is relatively well reverse-engineered - there exists Mesa FreeDreno driver for it, Qualcomm commits patches to the upstream KMS driver and most importantly the GPU ISA is similar to older AMD Radeon GPUs which is extensively documented by AMD. Besides, a similar GPU is used in Xbox 360 so one more place to learn is Xenia emulator which simulates the Xbox GPU.

Trying out and debugging Open-Source OpenCL on AMD Radeon RX480

2016-09-28T00:55:00.000+03:00

Introduction.

I have decided to check out the state of OpenCL support for AMD GPUs in Linux Open-Source stack. In this post I will describe the experience of debugging issues across the whole stack trying to get a few basic apps up and running.

I have the AMD RX480 GPU which supports GCN 8.0.1 instruction set and has the code name "polaris10". At the time of writing, both Linux kernel and the Mesa in Debian were too old and did not support this GPU. Besides, as we will see later, the OpenCL in Mesa does not work out of the box and we would need to learn to build it from source anyway.

Building the software.

A page at FreeDesktop website has some relatively outdated instructions but the general approach is the same. https://dri.freedesktop.org/wiki/GalliumCompute/
You will need to install a relatively fresh kernel (I built Linux 4.7.0-rc7). I also installed the polaris10 firmware manually but now it seems to be shipped in Linux/Debian.

I'm posting the steps I went through to build LLVM and Mesa with OpenCL support. After writing it I realized that perhaps everything here is redundant I should write a repo manifest to clone everything with one command.

Getting the sources.

I'm also posting the folder tree and git hashes just in case.

Build CLC

CLC is the runtime library for the OpenCL. It contains code that is compiled to LLVM bitcode and linked against your apps. It provides intrinsics and functions for the functions defined by the OpenCL standard such as "get_global_id" and "mad". In case you're wondering, CUDA works exactly the same way and the binary SDK from NVIDIA ships the relevant bitcodes (and you can disassemble them with llvm-dis if you're interested).

Actually, some of the definitions failed to compile with latest LLVM and I needed to add the explicit type casting. You can use the patch (from this github gist https://gist.github.com/astarasikov/9f00dee718f217c6a9715510dc09d300) and apply it on top of 88b82a6f70012a903b10dfc1e2304d3ef2e76dbc to fix it.

git clone https://github.com/llvm-mirror/libclc.git

./configure.py -g ninja --with-llvm-config=/home/alexander/Documents/workspace/builds/llvm/build/bin/llvm-config

ninja

ninja install

Get LLVM Sources.

mkdir -p ~/Documents/workspace/builds/llvm/

I have a useful script to check out the latest llvm, clang and libcxx. https://gist.github.com/astarasikov/a2e9287a34381f680d58

Get LLVM ROC branch.

This is optional. I used the ROC branch initially because I thought it would fix the issue with codegen (FLAT instructions) but it did not and otherwise it seems to behave identical to vanilla LLVM.

git remote add roc https://github.com/RadeonOpenCompute/llvm.git

git checkout roc/amd-common

git fetch roc

cd tools/clang/

git remote add roc https://github.com/RadeonOpenCompute/clang.git

git fetch roc

git checkout roc/amd-common

List of git hashes:

llvm - 1819637 Merge branch amd-master into amd-common

llvm/tools/clang/tools/extra - 079dd6a [clang-rename] Add comment after namespace closing

llvm/tools/clang - f779a93 Merge branch amd-master into amd-common

llvm/projects/libcxx - d979eed Fix Bug 30240 - std::string: append(first, last) error when aliasing.

llvm/projects/compiler-rt - 5a27c81 asan: allow __asan_{before,after}_dynamic_init without registered globals

llvm/projects/libcxxabi - 9f08403 [lit] Replace print with lit_config.note().

Build LLVM.

#create the build directory
mkdir -p ~Documents/workspace/builds/llvm/build/
cd cd ~Documents/workspace/builds/llvm/build/

cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_INCLUDE_TESTS=OFF -DLLVM_VERSION_SUFFIX="" ../llvm/ -DBUILD_SHARED_LIBS=ON

ninja

#add the symlink to make LLVM pick up internal headers when building Mesa
cd ~Documents/workspace/builds/llvm/build/include
ln -s $(echo $PWD/../tools/clang/include/clang) clang

In principle, it is not necessary to install llvm, it's enough to add it to the PATH and clang will pick up the necessary libraries itself.

Build MESA

Before building Mesa, we need to prepend the path to the "bin" directory of our custom LLVM build to the PATH variable so that clang is picked up as the compiler. I also had to add a symlink to the source code in the build directory because some headers were not getting picked up but I think there's a cleaner way to add it to CFLAGS.

I was using Mesa git 0d7ec8b7d0554382d5af6c59a69ca9672d2583cd.

git clone git://anongit.freedesktop.org/mesa/mesa

The configure.ac seems to have the incorrect regex for getting LLVM version which causes compilation to fail with the latest LLVM (4.0.0). Here is a patch to fix it and also force the radeonsi chip class to VI (Volcanic Islands). The latter is not strictly necessary but I used it during debugging to ensure the corect code path is always hit. Grab the diff at https://gist.github.com/astarasikov/6146dbbd07d0dc3bea2ee6a8b979eaa8

export PATH=~/Documents/workspace/builds/llvm/build/bin:$PATH
cd ~/Documents/workspace/builds/mesa/mesa/
make clean

./autogen.sh --enable-texture-float --enable-dri3 --enable-opencl --enable-opencl-icd --enable-sysfs --enable-gallium-llvm --with-gallium-drivers=radeonsi --prefix=/opt/my_mesa --with-egl-platforms=drm --enable-glx-tls

make install

Now, before running any OpenCL application, we'll need to override the library path to point to our custom Mesa.
export LD_LIBRARY_PATH=/opt/my_mesa/lib:/home/alexander/Documents/workspace/builds/llvm/build/lib

Useful Links

AMD Presentations about GCN ISA.

GCN ISA Manual

Intel OpenCL Samples

AMD App SDK

I've used the older version because I thought it was in the tarball and the latest one seemed to be an executable file (though actually it was a tarball with an executable script).

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ - Look for "AMD-APP-SDK-linux-v2.9-1.599.381-GA-x64.tar.bz2".

Trying it out.

Assertion in LLVM codegen.

So we can try running any OpenCL application now and we'll hit the assertion.
./BitonicSort -p Clover

After reading the source code and LLVM git log it turns out.
So what can we do? Let's try to see if we can force the LLVM to emit the abovementioned "FLAT Atomics".
Turns out the code is already there and there is a flag, which is enabled by default when the LLVM target is "AMD HSA".

Now, let's think what could the possible limitations of this approach be?

FLAT Instructions

Let's see the description from the GCN Manual. "Flat memory instructions let the kernel read or write data in memory, or perform atomic operations on data already in memory. These operations occur through the texture L2 cache.".

I have not fully understood the difference between these new FLAT instructions and older MUBUF/MTBUF. As it seems to me, before GCN3 it used to be the case that different address spaces (global, private etc) could only be accessed through different instructions and FLAT instructions allow accessing any piece of GPU (and host) memory by the virtual address (hence the name since virtual address space is flat). So it seems that as long as the kernel driver sets up GPU page tables correctly and we're only using the memory allocated through OpenCL API we should be fine.

FWIW, let's try running the code and see if it works.

LLVM Flat Instructions hack

As it has been mentioned before, we need to force LLVM to generate the "FLAT" instructions to access memory in GPU code.
A proper way to fix it would be to add the options to the Mesa source code into the location where it instantiates the LLVM compiler (clang).
To save us some time we can hack the relevant piece of the code generator in the LLVM directly (see the patch at the end of the article).

Trying it out again.

I have tried the following samples from the AMD App SDK and Intel Samples. I didn't want to write sample code myself and besides OpenCL claims to be portable so running the code from other IHVs or SDKs should be a good stress test of the toolchain.

AMD

BlackScholesDP
BinominalOption
BitonicSort
BinarySearch
BufferBandwidth

Intel

Bitonic Sort
Montecarlo
God Rays

All of them worked and the "verification" step which computes the data on the host and compares to the GPGPU result has passed! You can take a look at the screenshots and logs at the end of the post.

The "God Rays" demo even produces the convinceable picture.

Running Protonect (libfreenect2)

One of the apps I'm particularly interested in running is Protonect which is the demo app for libfreenect2, the open-source driver for Microsoft Kinect v2 RGBD ToF camera. Let's build it with OpenCL support and invoke it from the shell via "./Protonect cl".

And we're hitting an assertion!

'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

Since it happens in a destructor for the purposes of testing we can simply comment it out because the worst thing that could happen is a memory leak.

Let's try running it again!

And we're hitting another error. This time it is "unsupported initializer for address space". Okay, let's debug it. First, let's grep the string verbatim.

Good, we're hitting in only one place. Now, debugging message is not particularly helpful because it does not give us the precise location or the name of the variable which caused this error, it only shows the prototype of the function. Let's try to just print the address space type and try to find out what might be causing it. (see the patch at the end of the article).

What's this? Good, good, we see that the address space enum value is "0". Looking it up reveals that it's a private address space. Okay, what could cause the private address space to be used? Function-local arrays! Let's look for one! Great, here it is, see the "const float gaussian[9]" array in the "filterPixelStage1" function? Let's try commenting it out and replacing the "gaussian[j]" access with some constant (let it be 1.0f since it's a weighed average, otherwise if we choose 0.0f we'll see nothing in the output buffer). Yay, it worked!

Since we could not longer use the private address space we would need to find a way to get rid of the locally-declared array.OpenCL 1.1 does not support the static storage class.
One option would be to add another kernel argument and just pass the array there.
It might be slower depending though because it would get placed into the slower region of cache-coherent memory.

Another option would be to compute the values in-place and since it's just a 3x3 convolution kernel for a gaussian blur it's easy to come up with a crude approximation formula which is what I've done (see the patch at the end of the post).

Sor far, Protonect works. Performance is subpar but not completely awful. It's around 4 times slower than the NVIDIA GTX970 with the binary driver (in most real-world scenarios GTX970 and RX480 are quite close). I think that with a bit of profiling it can be sped up drastically. In fact, one of the contributing factors might be that my display is connected to the Intel iGPU and PCIE bandwidth is saturated by the framebuffer blitting (it's 4K after all). I'll try with OpenGL running on Radeon next time.

RadeonOpenCompute Initiative.

Since OpenCL is crap and CUDA is all the hype, it makes now wonder many people want to run CUDA on all GPUs.

In principle, to run CUDA on non-NVIDIA hardware, one needs to implement the CUDA runtime and the lowering from the NVPTX intermediate language to the native ISA. The challenging part is actually building the source code because one would need the proprietary headers from the NVIDIA toolchain. One could create the fake headers to mimic the CUDA toolchain but it is a shady legal area especially when it comes down to predefined macros and constants.

In fact, there's not much need to emit NVPTX at all and you can lower straight to the ISA. What AMD have done to work around legal issues is they've come up with their own language called "HIP" which mimics most of CUDA design but keywords and predefined macros are named differently. Therefore, porting is straightforward even with a search/replace function, but there's an automated translator based on clang.

GCN old RX480 vs R390

It looks interesting that Polaris10 (RX480) seems to use an older version of the ISA (8.0.1) while the older R390 uses 8.0.3. Not sure if it's a bug in documentation. However, it's interesting that AMD GPUs consist of multiple units (such as video encoder/decoder) and they seem to be picked up in arbitrary order when designing a new chip.

HIP without Mesa.

HIP/ROC ships its own runtime, and since all memory access is done through DRM via ioctls to "/dev/drm/cardX", Mesa is tecnhically not needed to implement OpenCL or whatever compute API.

However, the open question is buffer sharing between different APIs. I have came across this issue before when dealing with an Intel GPU. The good news is that on Linux there's an EGL extension to export the DRM buffer object (BO) from the GLES context (but not from GLX). You can read the old article about it here: https://allsoftwaresucks.blogspot.com/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html

Outstanding issues.

Slow Compilation time for Kernels.

While running OpenCL code samples and Protonect, I've noticed that they take several seconds to start up compared to immediate start when using Intel OpenCL driver (Beignet). I think it might get caused by LLVM. It would be a good idea to profile everything using the Linux "perf" tool.

Private Address Spaces.

As we've seen with libfreenect2 kernels, private address spaces do not work. It is necessary to figure out if they are not supported at all or it is just a simple bug.
Until this is resolved it effectively renders a lot of GPGPU applications unusable.

LLVM Assertions (Bugs).

As mentioned before, running Protonect and compiling libfreenect2 kernels yields the following assertion:
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

While trying to run one of the OpenCL samples I hit yet another assertion:
'clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.

Conslusions.

As we can see, the OpenCL support is still not mature with the Mesa/Clover stack.
However given the RadeonOpenCompute initiative I'm sure most of the missing features will be in place soon.
So far I'm glad that the just-releases GPU support is on par with older models and working around the issues was not too hard for me.
I've also satisfied part of my desire to understand the interaction between different components involved in the OpenCL/Compute pipeline.

I think for a start I will look at the LLVM assertions and see if I can debug them or prepare the test cases to submit upstream.
Next up I'll be trying out HIP to build some CUDA samples.

One idea I had in mind for quite some time was virtualizing a mobile GPU. I think Qualcomm Adreno is a good target because it's relatively well supported by the FreeDreno driver and the ISA is similar to other AMD chips. The plan is to add the ISA decoder and MMIO space emulation to QEMU so that it can be used both in KVM on ARM and in emulation mode on Intel. Of course, the most nerdy way to do it would be to make a translator from the guest ISA to the host ISA. But for a start we could reuse the Virgil driver as a target.
I think it would be a very useful thing for running legacy applications in a virtualized environment (such as Windows RT or automotive IVI systems) and could aid in security engineering.
Hopefully I will have enough motivation and time to do it before I'm bounded by an NDA :)

Latest update!

Also, check out the latest news! Looks like Mesa has now switched to using the HSA ABI by default which means that the hack for the FLAT instructions will not be needed with more recent versions and they will be enabled automagically! https://www.phoronix.com/scan.php?page=news_item&px=RadeonSI-HSA-Compute-Shaders

I started trying OpenCL on RX480 around 2 weeks ago, then I spent 1 week debugging and 1 week I was away. Meanwhile some changes seem to have landed upstream and some of the hacking described here may be redundant. I urge you to check with the latest source code but I decided to keep this post just to describe the process of debugging I went through.

Extra: Logs and Screenshots.

God Rays from Intel OpenCL samples.

Protonect running on AMD FOSS OpenCL stack.

LLVM Force FLAT instructions hack.

diff --git a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 3c4b5e7..f6d500c 100644
--- a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -46,7 +46,7 @@ AMDGPUSubtarget::initializeSubtargetDependencies(const Triple &TT,
// disable it.

SmallString<256> FullFS("+promote-alloca,+fp64-denormals,+load-store-opt,");
- if (isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
+ if (1 || isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
FullFS += "+flat-for-global,+unaligned-buffer-access,";
FullFS += FS;

Patch for Libfreenect2

@@ -102,8 +102,8 @@ void kernel processPixelStage1(global const short *lut11to16, global const float
/*******************************************************************************
* Filter pixel stage 1
******************************************************************************/
-void kernel filterPixelStage1(global const float3 *a, global const float3 *b, global const float3 *n,
- global float3 *a_out, global float3 *b_out, global uchar *max_edge_test)
+void kernel filterPixelStage1(__global const float3 *a, __global const float3 *b, __global const float3 *n,
+ __global float3 *a_out, __global float3 *b_out, __global uchar *max_edge_test)
{
const uint i = get_global_id(0);

@@ -113,7 +113,7 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
const float3 self_a = a[i];

const float3 self_b = b[i];

- const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};
+ //const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};

if(x < 1 || y < 1 || x > 510 || y > 422)
{
@@ -155,7 +155,9 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
const int3 c1 = isless(other_norm * other_norm, threshold);

const float3 dist = 0.5f * (1.0f - (self_normalized_a * other_normalized_a + self_normalized_b * other_normalized_b));
- const float3 weight = select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+ //const float3 weight = 1.0f;//select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+ const float gj = exp(0.6 - (0.3 * (abs(yi) + abs(xi))));
+ const float3 weight = select(gj * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);

LLVM Patch for assertion in AsmPrinter destructor.

diff --git a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 0fed4e9..0d63a2a 100644
--- a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -114,6 +114,7 @@ AsmPrinter::AsmPrinter(TargetMachine &tm, std::unique_ptr<MCStreamer> Streamer)
}

AsmPrinter::~AsmPrinter() {
+ return;
assert(!DD && Handlers.empty() && "Debug/EH info didn't get finalized");

if (GCMetadataPrinters) {

Patch for debugging Address Space issues.

diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 682157b..d2a5c4a 100644
--- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -766,6 +766,8 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
unsigned Offset = MFI->allocateLDSGlobal(DL, *GV);
return DAG.getConstant(Offset, SDLoc(Op), Op.getValueType());
}
+ default:
+ printf("%s: address space type=%d\n", __func__, G->getAddressSpace());
}

const Function &Fn = *DAG.getMachineFunction().getFunction();

OpenCL Bandwidth Test (AMD App SDK)

Intel (Beignet GPGPU Driver)

Platform found : Intel

Device 0 Intel(R) HD Graphics Haswell GT2 Desktop
Build: release
GPU work items: 32768
Buffer size: 33554432
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY
outputBuffer: CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution 256.11 ns
Page fault 531.44 ns
CPU read 15.31 GB/s
memcpy() 15.54 GB/s
memset(,1,) 26.54 GB/s
memset(,0,) 27.06 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 9513.290
---------------------------------------|---------------
memset() (GBPS) | 24.746
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 6176.168

2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 38.225

Verification Passed!

3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 26.198

4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 9830.400
---------------------------------------|---------------
CPU read (GBPS) | 15.431
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 10485.760

Verification Passed!

Passed!

AMD Radeon (OpenCL)

Platform found : Mesa

Device 0 AMD POLARIS10 (DRM 3.2.0 / 4.7.0-rc7-meow+, LLVM 4.0.0)
Build: release
GPU work items: 32768
Buffer size: 33554432
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY
outputBuffer: CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution 256.12 ns
Page fault 538.31 ns
CPU read 12.19 GB/s
memcpy() 11.38 GB/s
memset(,1,) 20.93 GB/s
memset(,0,) 22.98 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 7586.161
---------------------------------------|---------------
memset() (GBPS) | 6.369
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 12822.261

2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 113.481

Verification Passed!

3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 105.898

4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 9.559
---------------------------------------|---------------
CPU read (GBPS) | 17.179
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 4060.750

Verification Passed!

Passed!

Fuzzing Vulkans, how do they work?

2016-03-10T06:24:00.003+03:00

Introduction

Disclaimer: I have not yet fully read the specs on SPIR-V or Vulkan.

I decided to find out how hard it is to crash code working with SPIR-V.
Initially I wanted to crash the actual GPU drivers but for a start I decided
to experiment with GLSLang.

What I got

I used the "afl-fuzz" fuzzer to generate test cases that could crash
the parser. I have briefly examined the generated cases.
I have uploaded the results (which contain the SPIR-V binaries causing
the "spirv-remap" to crash) to the [following location](https://drive.google.com/file/d/0B7wcN-tOkdeRTGItSDhFM0JYUEk/view?usp=sharing)

Some of them trigger assertions in the code (which is not bad, but perhaps
returning an error code and shutting down cleanly would be better).
Some of them cause the code to hang for long time or indefinitely (which is worse
especially if someone intends to use the SPIR-V parser code online in the app).

Perhaps some of the results marked as "hangs" just cause too long compilation
time and could produce more interesting results if the timeout in "afl-fuzz"
is increased.

Two notable examples causing long compilation time are:
"out/crashes/id:000000,sig:06,src:000000,op:flip1,pos:15"
"out/hangs/id:000011,src:000000,op:flip1,pos:538" - for this one I waited for
a minute but it stil did not complete the compilation while causing 100% CPU load.

A log output of "glslang" indicating that most of the error cases found are handled, but with "abort" instead of graceful shutdown.
http://pastebin.com/BnZ63tKJ

NVIDIA

I have also tried using these shaders with the NVIDIA driver (since it was the only hardware I could run a real Vulkan driver on).

I have used the "instancing" demo from [SaschaWillems Repository](https://github.com/SaschaWillems/Vulkan) .
I patched it to accept the path to the binary shader via the command line.
Next, I fed it with the generated
test cases. Some of them triggered segfaults inside the NVIDIA driver.
What is curious is that when i used the "hangs" examples, they also caused
the NVIDIA driver to take extra long time to compile and eventually crash
at random places.

I think it indicates either that there is some common code between the driver
and GLSLang (the reference implementation) or the specification is missing
some sanity check somewhere and the compiler can get stuck optimizing certain
code.
Is there a place in specification that mandates that all the values are
checked to be within the allowed range, and all complex structures (such as
function calls) are checked recursively?

Perhaps I should have a look at other drivers (Mali anyone?).

```
[ 3672.137509] instancing[26631]: segfault at f ip 00007fb4624adebf sp 00007ffefd72e100 error 4 in libnvidia-glcore.so.355.00.29[7fb462169000+1303000]
[ 3914.294222] instancing[26894]: segfault at f ip 00007f00b28fcebf sp 00007ffdb9bab980 error 4 in libnvidia-glcore.so.355.00.29[7f00b25b8000+1303000]
[ 4032.430179] instancing[27017]: segfault at f ip 00007f7682747ebf sp 00007fff46679bf0 error 4 in libnvidia-glcore.so.355.00.29[7f7682403000+1303000]
[ 4032.915849] instancing[27022]: segfault at f ip 00007fb4e4099ebf sp 00007fff3c1ac0f0 error 4 in libnvidia-glcore.so.355.00.29[7fb4e3d55000+1303000]
[ 4033.011699] instancing[27023]: segfault at f ip 00007f7272900ebf sp 00007ffdb54261e0 error 4 in libnvidia-glcore.so.355.00.29[7f72725bc000+1303000]
[ 4033.107939] instancing[27025]: segfault at f ip 00007fbf0353debf sp 00007ffde4387750 error 4 in libnvidia-glcore.so.355.00.29[7fbf031f9000+1303000]
[ 4033.203924] instancing[27026]: segfault at f ip 00007f0f9a6f0ebf sp 00007ffff85a9dd0 error 4 in libnvidia-glcore.so.355.00.29[7f0f9a3ac000+1303000]
[ 4033.299138] instancing[27027]: segfault at 2967000 ip 00007fcb42cab720 sp 00007ffcbad45228 error 6 in libc-2.19.so[7fcb42c26000+19f000]
[ 4033.394667] instancing[27028]: segfault at 36d2000 ip 00007efc789eb720 sp 00007fff26c636d8 error 6 in libc-2.19.so[7efc78966000+19f000]
[ 4033.490918] instancing[27029]: segfault at 167b15e170 ip 00007f3b02095ec3 sp 00007ffd768cbf68 error 4 in libnvidia-glcore.so.355.00.29[7f3b01cbe000+1303000]
[ 4033.586699] instancing[27030]: segfault at 2ffc000 ip 00007fdebcc06720 sp 00007fff4fe59bd8 error 6 in libc-2.19.so[7fdebcb81000+19f000]
[ 4033.682939] instancing[27031]: segfault at 8 ip 00007fb80e7eed50 sp 00007ffe9cd21de0 error 4 in libnvidia-glcore.so.355.00.29[7fb80e410000+1303000]
[ 4374.480872] show_signal_msg: 27 callbacks suppressed
[ 4374.480876] instancing[27402]: segfault at f ip 00007fd1fc3cdebf sp 00007ffe483ff520 error 4 in libnvidia-glcore.so.355.00.29[7fd1fc089000+1303000]
[ 4374.809621] instancing[27417]: segfault at 2e0c3910 ip 00007f39af846e96 sp 00007ffe1c6d8f10 error 6 in libnvidia-glcore.so.355.00.29[7f39af46f000+1303000]
[ 4374.905112] instancing[27418]: segfault at 2dc46a68 ip 00007f7b9ff7af32 sp 00007fff290edf00 error 6 in libnvidia-glcore.so.355.00.29[7f7b9fba2000+1303000]
[ 4375.001019] instancing[27419]: segfault at f ip 00007f5a4e066ebf sp 00007ffe0b775d70 error 4 in libnvidia-glcore.so.355.00.29[7f5a4dd22000+1303000]
[ 4375.096894] instancing[27420]: segfault at f ip 00007f7274d49ebf sp 00007ffe96fdea10 error 4 in libnvidia-glcore.so.355.00.29[7f7274a05000+1303000]
[ 4375.193165] instancing[27421]: segfault at f ip 00007fa3bf3c8ebf sp 00007ffc4117e8d0 error 4 in libnvidia-glcore.so.355.00.29[7fa3bf084000+1303000]
[ 4375.288969] instancing[27423]: segfault at f ip 00007f50e0327ebf sp 00007ffc02aa1d50 error 4 in libnvidia-glcore.so.355.00.29[7f50dffe3000+1303000]
[ 4375.385530] instancing[27424]: segfault at f ip 00007f0d9a32eebf sp 00007ffd0298eb40 error 4 in libnvidia-glcore.so.355.00.29[7f0d99fea000+1303000]
[ 4375.481829] instancing[27425]: segfault at f ip 00007f8400bc5ebf sp 00007ffef0334240 error 4 in libnvidia-glcore.so.355.00.29[7f8400881000+1303000]
[ 4375.576983] instancing[27426]: segfault at 2dec2bc8 ip 00007f52260afec3 sp 00007fffd2bd1728 error 4 in libnvidia-glcore.so.355.00.29[7f5225cd8000+1303000]
```

How to reproduce

Below are the steps I have taken to crash the "spirv-remap" tool.
I believe this issue is worth looking at because some vendors may
choose to build their driver internals based on the reference implementations
which may lead to bugs directly crippling into the software as-is.

0. I have used the Debian Linux box. I have installed the "afl-fuzz" tool,
and also manually copied the Vulkan headers to "/usr/include".

1. cloned the GLSLang repository
```
git clone git@github.com:KhronosGroup/glslang.git
cd glslang
```

2. Compiled it with afl-fuzz
```
mkdir build
cat SetupLinux.sh
cd build/
cmake -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++ ..
cd ..
```

3. Compiled a sample shader from the GLSL to the SPIR-V format using
```
./build/install/bin/glslangValidator -V -i Test/spv.130.frag
```

4. Verified that the "spirv-remap" tool works on the binary
```
./build/install/bin/spirv-remap -v -i frag.spv -o /tmp/
```

5. Fed the SPIR-V binary to the afl-fuzz
```
afl-fuzz -i in -o out ./build/install/bin/spirv-remap -v -i @@ -o /tmp
```

6. Quickly discovered several crashes. I attach the screenshot of afl-fuzz
in the work.

7. Examined them.

First, I made a hex diff of the good and bad files. The command to generate
the diff is the following:
```
for i in out/crashes/*; do hexdump in/frag.spv > in.hex && hexdump $i > out.hex && diff -Naur in.hex out.hex; done > hex.diff
```

Next, I just ran the tool on all cases and generated the log of crash messages.
```
for i in out/crashes/*; do echo $i && ./build/install/bin/spirv-remap -i $i -o /tmp/ 2>&1 ;done > abort.log
```

Conclusions

Well, there are two obvious conclusions:

1. Vulkan/SPIR-V is still a WIP and drivers are not yet perfect

2. GPU drivers have always been notorious for poor compilers - not only codegen, but parsers and validators. Maybe part of the reason is that CPU compilers simply handle more complex code and therefore more edge cases have been hit already.

Notes on firmware and tools

2016-02-25T06:07:00.001+03:00

Introduction.

In this blog post I mainly want to summarize my latest experiments at using clang's static analyzer and some thoughts on what could be further done at the analyzer, and open-source software quality in general. It's mostly some notes I've decided to put up.

Here are the references to the people involved in developing clang static analyzer at Apple. I recommend following them on twitter and also reading their presentation on developing custom checkers for the analyzer.
Ted Kremenek - https://twitter.com/tkremenek
Anna Zaks - https://twitter.com/zaks_anna
Jordan Rose - https://twitter.com/UINT_MIN

"How to Write a Checker in 24 Hours" - http://llvm.org/devmtg/2012-11/Zaks-Rose-Checker24Hours.pdf
"Checker Developer Manual" - http://clang-analyzer.llvm.org/checker_dev_manual.html - This one requires some understanding of LLVM, so I recommend to get comfortable with using the analyzer and looking at AST dumps first.

There are not so many analyzer plugins which not made at Apple.

The "Tartan" which is intended to check the GLib conventions used in GNOME projects. https://www.freedesktop.org/software/tartan/
A presentation called "MPI-Checker – Static Analysis for MPI" https://llvm-hpc2-workshop.github.io/slides/Droste.pdf

GLibc.

Out of curiosity I ran the clang analyzer on glibc. Setting it up was not a big deal - in fact, all that was needed was to just run the scan-build. It did a good job of intercepting the gcc calls and most of the sources were successfully analyzed. In fact, I did not do anything special, but even running the analyzer with the default configuration revealed a few true bugs, like the one showed in the screenshot.

For example, the "iconv" function has a check to see if the argument "outbuf" is NULL, which indicates that it is a valid case expected by the authors. The manpage for the function also says that passing a NULL argument as "outbuf" is valid. However, we can see that one of the branches is missing the similar check for NULL pointer, which probably resulted from a copy-paste in the past. So, passing valid pointers to "inbuf" and "inbytesleft" and a NULL pointer for "outbuf" leads to the NULL pointer dereference and consequently a SIGSEGV.

Fun fact: my example is also not quite correct, as pointed out by my Twitter followers. The third argument to iconv must be a pointer to the integer, not the integer. However, on my machine my example crashes at dereferencing the "outbuf" pointer and not the "inbytesleft", because the argument evaluation order in C is not specified, and on x86_64 arguments are usually pushed to the stack (and therefore evaluated) in the reverse order.

Is it a big deal? Who knows. On the one hand, it's a userspace library, and unlike OpenSSL, no one is likely to embed it into kernel-level firmware. On the other, I may very well imagine a device such as a router or a web kiosk where this NULL pointer dereference could be triggered, because internalization and text manipulation is always a complex issue.

Linux Kernel.

I had this idea of trying to build LLVMLinux with clang for quite a while, but never really had the time to do it. My main interest in doing so was using the clang static analyzer.

Currently, some parts of Linux Kernel fail to build with clang, so I had to use the patches from the LLVMLinux project. They failed to apply cleanly though. I had to manually edit several patches. Another problem is that "git am" does not support the "fuzzy" strategy when applying the patches, so I had to use a certain script found on GitHub which uses "patch" and "git commit" to do the same thing.
https://gist.github.com/kfish/7425248

I have pushed my tree to github. I based it off the latest upstream at the time when I worked on it (which was the start of February 2016). The branch is called "4.5_with_llvmlinux".
https://github.com/astarasikov/linux/tree/4.5_with_llvmlinux

I've used the following commands to get the analysis results. Note that some files have failed to compile, and I had to manually stop the compilation job for one file that took over 40 minutes.

export CCC_CXX=clang++
export CCC_CC=clang

scan-build make CC=clang HOSTCC=clang -j10 -k

Ideas.

Porting clang instrumentations to major OpenSource projects.

Clang has a number of instrumentation plugins called "Sanitizers" which were largely developed by Konstantin Serebryany and Dmitry Vyukov at Google.

Arguably the most useful tool for C code is AddressSanitizer which allows to catch Use-After-Free and Out-of-Bounds access on arrays. There exists a port of the tool for the Linux Kernel, called Kernel AddressSanitizer, or KASAN, and it has been used to uncover a lot of memory corruption bugs, leading to potential vulnerabilities or kernel panics.

Another tool, also based on compile-time instrumentation, is the ThreadSanitizer for catching data races, and it has also been ported to the Linux Kernel.

https://github.com/google/kasan/wiki

https://github.com/google/kasan/wiki/Found-Bugs

I think it would be very useful to port these tools to the other projects. There are a lot of system-level software driving the critical aspects of system initialization process. To name a few:

EDK2 UEFI Development Kit
LittleKernel bootloader by Travis Geiselbrecht. It has been extensively used in the embedded world instead of u-boot lately. Qualcomm (and Codeaurora, its Open-Source department) is using a customized version of LK for its SoCs, and nearly every mobile phone with Android has LK inside it. NVIDIA have recently started shipping a version of LK called "TLK" or "Trusted Little Kernel" in its K1 processors, with the largest deployment yet being the Google Nexus 9 tablet.
U-boot bootloader, which is still used in a lot of devices.
FreeBSD and other kernels. While I don't know how widely they are deployed today, it would still be useful, at least for junior and intermediate kernel hackers.
XNU Kernel powering Apple iOS and OS X. I have some feeling that Apple might be using some of the tools internally, though the public snapshots of the source are heavily stripped of newest features.

Btw, if anyone is struggling at getting userspace AddressSanitizer working with NVIDIA drivers, take a look at this thread where I posted a workaround. Long story short, NVIDIA mmaps memory at large chunks up to a certain range, and you need to change the virtual address which ASan claims for itself.

http://marc.info/?l=llvm-dev&m=141168078303757&w=2
http://marc.info/?l=llvm-dev&m=141169586008101&w=2

Note that you will likely get problems if you build a library with a customized ASan and link it to something with unpatched ASan, so I recommend using a distro where you can rebuild all packages simultaneously if you want to experiment with custom instrumentation, such as NixOS, Gentoo or *BSD.

As I have mentioned before, there is a caveat with all these new instrumentation tools - most of them are made for the 64-bit architecture while the majority of the embedded code stays 32-bit. There are several ways to address it, such as running memory checkers in a separate process or using a simulator such as a modified QEMU, but it's an open problem.

Other techniques for firmware quality

In one of my previous posts I have drawn attention to gathering code coverage in low-level software, such as bootloaders. With GCOV being quite easy to port, I think more projects should be exposed to it. Also, AddressSanitizer now comes with a custom infrastructure for gathering code coverage, which is more extensive than GCOV.

Here's a link to the post which shows how one can easily embed GCOV to get coverage using Little Kernel as an example.

http://allsoftwaresucks.blogspot.ru/2015/05/gcov-is-amazing-yet-undocumented.html

While userspace applications have received quite a lot of fuzzing recently, especially with the introduction of the AFL fuzzer tool, kernel-side received little attention. For Linux, there is a syscall fuzzer called Trinity.

http://codemonkey.org.uk/projects/trinity/

https://github.com/kernelslacker/trinity

There is also an interesting project at fuzzing kernel through USB stack (vUSBf).

https://github.com/schumilo/vUSBf

https://www.blackhat.com/docs/eu-14/materials/eu-14-Schumilo-Dont-Trust-Your-USB-How-To-Find-Bugs-In-USB-Device-Drivers.pdf

What should be done is adapting this techniques for other firmware projects. On the one hand, it might get tricky due to the limited debugging facilities available in some kernels. On the other one, the capabilities provided by simulators such as QEMU are virtually unlimited (pun unintended).

An interesting observation might be that there are limited sources of external data input on a system. They include processor interrupt vectors/handlers and MMIO hardware. As for the latter, in Linux and most other firmwares, there are certain facilities for working with MMIO - such as functions like "readb()", "readw()" and "ioremap". Also, if we're speaking of a simulator such as QEMU, we can identify memory regions of interest by walking the page table and checking the physical address against external devices, and also checking the access type bits - for example, the data marked as executable is more likely to be the firmware code, while the uncached contiguous regions of memory are likely DMA windows.

ARM mbed TLS library is a good example of a project that tries to integrate as many dynamic and static tools into its testing process. However, it can be built as a userspace library on desktop, which makes it less interesting in the context of Firmware security.

https://blog.gdssecurity.com/labs/2015/9/21/fuzzing-the-mbed-tls-library.html

https://tls.mbed.org/kb/generic/what-tests-and-checks-are-run-for-mbedtls

https://www.mbed.com/en/development/software/mbed-tls/

Another technique that has been catching my attention for a lot of time is symbolic execution. In many ways, it is a similar problem to the static analysis - you need to find a set of input values constrained by certain equations to trigger a specific execution path leading to the incorrect state (say, a NULL pointer dereference).

Rumors are, a tool based on this technique called SAGE is actively used at Microsoft Research to analyze internal code, but sadly there are not so many open-source and well-documented tools one can easily play with and directly plug into an existing project.

An interesting example of applying this idea to the practical problem at a large and established corporation (Intel) is presented in the paper called "Symbolic execution for BIOS security" which tries to utilize symbolic execution with KLEE for attacking firmware - the SMM handler. You can find more interesting details in the blog of one of the authors of the paper - Vincent Zimmer (https://twitter.com/vincentzimmer and http://vzimmer.blogspot.com/).

Also, not directly related, but an interesting presentation about bug hunting on Android.

http://www.slideshare.net/jiahongfang5/qualcomm2015-jfang-nforest

I guess now I will have to focus on studying recent exploitation techniques used for hacking XNU, Linux and Chromium to have a more clear picture of what's needed to achieve :)

Further developing clang analyzer.

One feature missing from clang is the ability to perform the analysis across the translation units. A Translation Unit (or TU) in clang/LLVM roughly represents a file on the disk being parsed. Clang analyzer is implemented as a pass which traverses the AST and is limited to the one translation unit. Which is not quite true - it will go and analyze the includes recursively. But if you have two separate "C" files, which do not include one another, and a function from one file is calling the function from another one, the analysis will not work.

Implementing a checker across multiple sources is not a trivial thing, as pointed out by the analyzer authors in the mailing list, though this topic is often brought up by different people.

http://lists.llvm.org/pipermail/cfe-dev/2013-January/027234.html

http://lists.llvm.org/pipermail/cfe-dev/2016-January/046905.html

Often, it is possible to come up with a workarounds, especially if one aims at implementing ad-hoc checks for their project. The simplest one would be creating a top-level "umbrella" C file which would include all files with implementations of the functions. I have seen some projects do exactly this. The most obvious shortcomings of this approach is that it will require redesigning the whole structure of the project, and will not work if some of the translation units need custom C compiler options.

Another option would be to dump/serialize the AST and any additional information during the compilation of each TU and process it after the whole project is built. It looks like this approach has been proposed multiple times on the mailing list, and there exist at least one paper which claims doing that.

"Towards Vulnerability Discovery Using Extended Compile-time Analysis" - http://arxiv.org/pdf/1508.04627.pdf

In fact, the analyzer part itself might very well be independent of the actual parser, and could be reused, for example, to analyze the data obtained by the disassembly, but it's a topic for another research area.

Developing ad-hoc analyzers for certain projects.

While it is very difficult to statically rule out many kinds of errors in an arbitrary project, either due to state explosion problem or the dynamic nature of the code, quite often we can design tools to verify some contracts specific to a certain project.

An example of such tool would be a tool called "sparse" from the Linux Kernel. It effectively works as a parser for the C code and can be made to run on every C file compiled by the GCC while building the kernel. It allows to specify some annotations to the declarations in the C code. It works similar to how attributes were implemented in GCC and Clang later.

https://en.wikipedia.org/wiki/Sparse.

http://linuxwireless.org/en/developers/Documentation/using-sparse/

One notable example of the code in the Linux Kernel, which deals with passing void pointers around and relying on pointer trickery via the "container_of" macro is the workqueue library.

While working on the FreeBSD kernel during the GSoC in 2014, I faced a similar problem while developing device drivers - at certain places pointers were cast to void, and casted back to typed ones where needed. Certainly it's easy to make a mistake while coding these things.

Now, if we dump enough information during compilation, we can implement advanced checks. For example, when facing a function pointer which is initialized dynamically, we can do two things. First, find all places where it can potentially be initialized. Second, find all functions with a matching prototype. While checking all of them might be time consuming and generate false positives, it will also allow to check more code statically at compilation time.

A notable source of problems when working with C code is that linking stage is traditionally separated from the compilation stage. The linker usually manipulates the abstract "symbols" which are just void pointers. Even though it could be possible to store enough information about the types in a section of the ELF (in fact, DWARF debugging data contains information about the types) and use it to type-check symbols when linking, it's not usually done.

It leads to certain funky problems. For example, aliases (weak aliases) are a linker-time feature. If one defines an alias to some function, where the type of the alias does not match the type of the original function, the compiler cannot check it (well, it could if someone wrote a checker, but it does not), and you will get a silent memory corruption at runtime. I once ran into this issue when porting the RUMP library which ships a custom LIBC, and our C library had different size for "off_t".

Refactoring

There are two ideas I had floating around for quite a while.

Microkernelizing Linux.

An interesting research problem would be coming up with a tool to automatically convert/rewrite linux-kernel code into a microkernel with device drivers and other complex pieces of code staying in separate protection domains (processes).

It is interesting for several reasons. One is that a lot of microkernels, such as L4, rely on DDE-Kits to run pieces of Linux and other systems (such as NetBSD in case of RUMP) to provide drivers. Another is that there's a lot of tested code with large functionality, which could possibly made more secure by minimizing the impact of memory corruption.

Besides obvious performance concerns there are a lot of research questions to this.

Converting access to global variables to IPC accesses. Most challenging part would be dealing with function pointers and callbacks.
Parsing KConfig files and "#ifdef" statements to ensure all conditional compilation cases are covered when refactoring. This in itself is a huge problem for every C codebase - if you change something in one branch of an "#ifdef" statement, you cannot guarantee you didn't break it for another branch. To get whole coverage, it would be useful to come up with some automated way to ensure all configurations of "#ifdef" are built.
Deciding which pieces of code should be linked statically and reside in the same process. For example, one might want to make sound subsystem, drm and vfs to run as separate processes, but going as far as having each TU converted to a process would be an overkill and a performance disaster.

Translating C code to another language. Not really sure if it could be really useful. It is likely that any complex code involving pointers and arrays would require a language with similar capabilities as a target (if we're speaking of generating a useful and readable idiomatic code, and not just a translator such as emscripten). Therefore, the target language might very well have the same areas with unspecified behavior. Some people have proposed for a stricter and a more well-defined dialect of C.

One may note that it is not necessary to use clang for any of these tasks. In fact, one can get away with writing a custom parser or hacking GCC. These options are perfectly legal. I had some of these ideas floating around for at least three years, but back then I didn't have skills to build GCC and hack on it, and now I've been mostly using clang, but it's not a big deal.

Conclusion

Overall, neither improving the compiler nor switching to another language alone would not save us from errors. What's often overlooked is the process of the continuous integration, with regression tests to show what became broken, with gathering coverage, with doing integration tests.

There are a lot of non-obvious parameters which are difficult to measure. For example, how to set up a regression test that would detect a software breakage due to a new compiler optimization.

Besides, we could possibly borrow a lot of practices from the HDL/Digital Design industry. Besides coverage, an interesting idea is simulating the same design in different languages with different semantics to hope that if there are mistakes at the implementation stage, they are not at the same places, and testing will show where outputs of two systems diverge.

P.S.

Ping me if you're interested in working on the topics mentioned above :)

On GPU ISAs and hacking

2016-01-25T04:42:00.000+03:00

Recently I've been reading several interesting posts on hacking Windows OpenGL drivers to exploit the "GL_ARB_get_program_binary" extension to access raw GPU instructions.

"Hacking GCN via OpenGL" by Tomasz Stachowiak (@h3r2tic) - Presentation on OneDrive
"You Compiled This, Driver. Trust Me…." about hacking Intel Haswell GPU - The blog by Joshua Barczak ‏@JoshuaBarczak

Reverse engineering is always cool and interesting. However, a lot of these efforts have been previously done not once because people directed their efforts at making open-source drivers for linux, and I believe these projects are a good way to jump-start into GPU architecture without reading huge vendor datasheets.

Since I'm interested in the GPU drivers, I decided to put up links to several open-source projects that could be interesting to fellow hackers and serve as a reference, potentially more useful than vendor documentation (which is not available for some GPUs. you know, the green ones).

Please also take a look at this interesting blog post for the detailed list of GPU datasheets - http://renderingpipeline.com/graphics-literature/low-level-gpu-documentation/

* AMD R600 GPU Instruction Set Manual. It is an uber-useful artifact because many modern GPUs (older AMD desktop GPUs, Qualcomm Adreno GPUs and Xbox 360 GPU) bear a lot of similarities with AMD R300/R600 GPU series.
http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf

* Xenia Xbox 360 Emulator. It features a Dynamic Binary Translation module that translates the AMD instruction stream into GLSL shaders at runtime. Xbox 360 has an AMD Radeon GPU, similar to the one in R600 series and Adreno
https://github.com/benvanik/xenia

Let's look at "src/xenia/gpu". Warning: the code at this project seems to be changing very often and the links may quickly become stale, so take a look yourself.
One interesting file is "register_table.inc" which has register definitions for MMIO space allowing you to figure out a lot about how the GPU is configured.
The most interesting file is "src/xenia/gpu/ucode.h which contains the structures to describe ISA.
src/xenia/gpu/packet_disassembler.cc contains the logic to disassemble the actual GPU packet stream at binary level
src/xenia/gpu/shader_translator.cc contains the logic to translate the GPU bytecode to the IR
src/xenia/gpu/glsl_shader_translator.cc the code to translate Xenia IR to GLSL
Part of the Xbox->GLSL logic is also contained in "src/xenia/gpu/gl4/gl4_command_processor.cc" file.

* FreeDreno - the Open-Source Driver for the Qualcomm/ATI Adreno GPUs. These GPUs are based on the same architecture that the ATI R300 and consequently AMD R600 series, and both ISAs and register spaces are quite similar. This driver was done by Rob Clark, who is a GPU driver engineer with a lot of experience who (unlike most FOSS GPU hackers) previously worked on another GPU at a vendor and thus had enough experience to successfully reverse-engineer a GPU.

https://github.com/freedreno/mesa

Take a look at "src/gallium/drivers/freedreno/a3xx/fd3_program.c" to have an idea how the GPU command stream packet is formed on the A3XX series GPU which is used in most Android smartphones today. Well, it's the same for most GPUs - there is a ring-buffer to which the CPU submits the commands, and from where the GPU reads them out, but the actual format of the buffer is vendor and model specific.

I really love this work and this driver because it was the first open-source driver for mobile ARM SoCs that provided full support for OpenGL ES and non-ES enough to run Gnome Shell and other window managers with GPU accelleration and also supports most Qualcomm GPUs.

* Mesa. Mesa is the open-source stack for several compute and graphic APIs for Linux: OpenGL, OpenCL, OpenVG, VAAPI and OpenMAX.

Contrary to the popular belief, it is not a "slow" and "software-rendering" thing. Mesa has several backends. "Softpipe" is indeed a slow and CPU-side implementation of OpenGL, "llvmpipe" is a faster one using LLVM for vectorizing. But the most interesting part are the drivers for the actual hardware. Currently, Mesa supports most popular GPUS (with the notable exception of ARM Mali). You can peek into Mesa source code to get insight into how to control both the 2D/ROP engines of the GPUs and how to upload the actual instructions (the ISA bytecode) to the GPU. Example links are below. It's a holy grail for GPU hacking.

"http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i915/i915_program.c" contains the code to emit I915 instructions.
The code for the latest GPUs - Broadwell - is in the "i965" driver.

The code to set up the compute unit - the number of threads/groups - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_compute.c#L76
The dreaded CURBE described in the Joshua's post - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_curbe.c#L295
The code to encode the instructions for the Broadwell EU (Execution Unit) - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_eu_emit.c
Broadwell Instruction Disassembler - http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_disasm.c

While browsing Mesa code, one will notice the mysterious "bo" name. It stands for "Buffer Object" and is an abstraction for the GPU memory region. It is handled by the userpace library called "libdrm" - http://cgit.freedesktop.org/mesa/drm/tree/

* Linux Kernel "DRM/KMS" drivers. KMS/DRM is a facility of Linux Kernel to handle GPU buffer management and basic GPU initialization and power management in kernel space. The interesting thing about these drivers is that you can peek at the places where the instructions are actually transferred to the GPU - the registers containing physical addresses of the code.
There are other artifacts. For example, I like this code in the Intel GPU driver (i915) which does runtime patching of GPU instruction stream to relax security requirements.

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_cmd_parser.c

* Beignet - an OpenCL implementation for Intel Ivy-Bridge and later GPUs. Well, it actually works.
http://cgit.freedesktop.org/beignet/
The interesting code to emit the actual command stream is at http://cgit.freedesktop.org/beignet/tree/src/intel/intel_gpgpu.c

* LibVA Intel H.264 Encoder driver. Note that the proprietary MFX SDK also uses the LibVA internally, albeit a patched version with newer code which typically gets released and merged into upstream LibVA after it goes through Intel internal testing.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/
An interesting artifact are the sources with the encoder pseudo-assembly.

http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/interpolate_Y_8x8.asm
http://cgit.freedesktop.org/vaapi/intel-driver/tree/src/shaders/h264/mc/AVCMCInter.asm

Btw, If you want zero-copy texture sharing to pass OpenGL rendering directly to the H.264 encoder with either OpenGL ES or non-ES, please refer to my earlier post - http://allsoftwaresucks.blogspot.ru/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html and to Weston VAAPI recorder.

* Intel KVM-GT and Xen-GT. These are GPU virtualization solutions from Intel. They emulate the PCI GPU in the VM. They expose the MMIO register space fully compatible with the real GPU so that the guest OS uses the vanilla or barely modified driver. The host contains the code to reinterpret the GPU instruction stream and also to set up the host GPU memory space in such a way that it's partitioned equally between the clients (guest VMs).

There are useful papers, and interesting pieces of code.

http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf
https://github.com/01org/KVMGT-qemu/blob/master/hw/display/vga-vgt.c - the QEMU PCI device code.
Linux Kernel Patch - https://github.com/01org/KVMGT-kernel/commit/a17514b9696754f21d37c112139064a91a34a73e
The LKML message with the detailed announcement - https://lwn.net/Articles/624516/

* Intel GPU Tools

A lot of useful tools. Most useful one IMHO is the "intel_perf_counters" which shows compute unit, shader and video encoder utilization. There are other interesting tools like register dumper.

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tools

It also has the Broadwell GPU assembler

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/assembler

And the ugly Shaders in ugly assembly similar to those in LibVA

http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/shaders/gpgpu/gpgpu_fill.gxa

I once wrote the tool complimentary to the register dumper - to upload the register dump back to the GPU registers. I used it to debug some issues with the eDP panel initialization in my laptop

https://github.com/astarasikov/intel-gpu-dump-upload

* LunarGLASS is a collection of experimental GPU middleware. As far as I understand it, it is an ongoing initiative to build prototype implementations of upcoming Khronos standards before trying to get them merged to mesa. And also to allow the code to be used with any proprietary GPU SW stack, not tied to Mesa. Apparently it also has the "GLSL" backend which should allow you to try out SPIR-V on most GPU stacks.

The main site - http://lunarg.com/lunar-glass/
The SPIR-V frontend source code - https://github.com/LunarG/LunarGLASS/tree/master/Frontends/SPIRV
The code to convert LunarG IR to Mesa IR - https://github.com/LunarG/LunarGLASS/blob/master/mesa/src/mesa/program/ir_to_mesa.cpp

* Lima driver project. It was a project for reverse-engineering the ARM Mali driver. The authors (Luc Verhaegen aka "libv" and Connor Abbot) have decoded some part of the initialization routine and ISA. Unfortunately, the project seems to have stopped due to legal issues. However, it has significant impact on the open-source GPU drivers, because Connor Abbot went on to implement some of the ideas (namely, the NIR - "new IR") in the Mesa driver stack.

Some interesting artifacts:
The place where the Geometry Processor data (attributes and uniforms) are written to the GPU physical memory:
https://github.com/limadriver/lima/blob/master/limare/lib/gp.c#L333
"Render State" - a small structure describing the GPU memory, including the shader program address.
https://github.com/limadriver/lima/blob/master/limare/lib/render_state.h

* Etnaviv driver project. It was a project to reverse-engineer the Vivante GPU used by Freescale I.MX SoC series. It was done by Wladimir J. van der Laan. The driver has reached the mature state, successfully running games like Quake. Now, there is a version of Mesa with this driver built-in.

https://github.com/etnaviv/etna_viv
https://github.com/laanwj/etna_viv
https://github.com/etnaviv/mesa
http://wiki.wandboard.org/index.php/Etna_viv

Abusing QEMU and mmap() or Poor Man's PCI Pass-Through

2016-01-18T22:16:00.000+03:00

The problem.

Another tale from the endless firmware endeavours.
The other day I ended up in a weird situation. I was locked in a room with three firmwares - one of them was built by myself from source, while the other two came in binary form from the vendor. Naturally the one I built was not fully working, and one of the binaries worked fine. So I set out on a quest to trace all PCI MMIO accesses.

When one thinks about it, several solutions come to mind. The first idea is to run a QEMU/XEN VM and pass-through the PCI device. Unfortunately the board we had did not have IOMMU (or VT-D in Intel terminology). The board had the Intel BayTrail SoC which is basically a CPU for phones.
Another option would be to binary-patch the prebuilt firmware and introduce a set of wrapper functions around MMIO access routines and add corresponding calls to printf() for debugging but this approach is very tedious and error-prone, and I try to avoid patching binaries because I'm extremely lazy.

So I figured the peripherals of interest (namely, the I2C controller and a NIC card) were simple devices configured by simple register writes. The host did not pass any shared memory to the device, so I used a simple technique. I patched QEMU and introduced "fake" PCI devices with VID and PID of the devices I wanted to "pass-through". By default the MMIO read access return zero (or some other default value) while writes are ignored. Then I went on and added the real device access. Since I was running QEMU on Linux, I just blacklisted and rmmod'ed the corresponding drivers. Then, I used "lspci" to find the addresses of the PCI BARs and mmap() syscall on "/dev/mem" to access the device memory.

A picture is worth a thousand words

The code.

The code is quite funky, but thanks to the combination of many factors (X86 being a strongly ordered architecture and the code using only 4-byte transfers) the naive implementation worked fine.
1. I can now trace all register reads and writes
2. I have verified that the "good" firmware still boots and everything is working fine.
So, after that I simply ran all three firmwares and compared the register traces only to find out that the problems might be unrelated to driver code (which is a good result because it allowed to narrow down the amount of code to explore).

The branch with my customized QEMU is here. Also, I figured QEMU now has the ultra-useful option called "readconfig" which allows you to supply the board configuration in a text file and avoid messing with cryptic command line switches for enabling PCI devices or HDDs.
https://github.com/astarasikov/qemu/tree/our_board
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7

The code for mmapping the PCI memory is here. It is a marvel of YOLO-based design, but hey, X11 does the same, so it is now industry standard and unix-way :)
https://github.com/astarasikov/qemu/commit/2e60cad67c22dca750852346916d6da3eb6674e7#diff-febf7a335ad7cd658784899e875b5cc7R31

Limitations.

There is one obvious limitation, which is also the reason QEMU generally does not support passthrough of MMIO devices on systems without IOMMU (Like VT-D or SMMU).

Imagine a case where the device is not only configured via simple register transfers (such as setting certain bits to enable IRQ masking or clock gating), but system memory is passed to the device via storing a pointer in a register.

The problem is that the address coming from the guest VM (GPA, guest physical address) will not usually match the corresponding HPA (host physical address). The device however operates the bus addresses and is not aware of the MMU, let alone VMs. IOMMU solves this problem by translating the addresses coming from the device into virtual addresses, and the hypervisor can supply a custom page table to translate bus HPAs into GPAs.

So if one would want to provide a pass-through solution for sharing memory in systems without IOMMU, one would either need to ensure that the VM and guest have 1:1 GPA/HPA mappings for the addresses used by the device or come up with an elaborate scheme which would detect register writes containing memory buffers and set up a corresponding buffer in the host memory space, but then one will need to deal with keeping the host and guest buffers consistent. Besides, that would require the knowledge of the device registers, so it would be non-trivial to build a completely generic, plug-and-play solution.

Porting legacy RTOS software to a custom OS.

2015-11-08T06:32:00.001+03:00

So lately I've been working on porting a large-sized application from a well-known RTOS to a custom kernel. The RTOS is VxWorks and the application is a router firmware. Unfortunately I will not be able to share any code of the porting layer. I'll just briefly go over the problems I've endeavoured and some thoughts I have come up with in the process.

The VxWorks compiler (called "diab") is like many other proprietary C compilers based on the Edison frontend. Unlike GCC or clang, by default it is quite lax when it comes to following the standards. Besides, the original firmware project uses the VxWorks Workbench IDE (which is of course a fork of Eclipse) and runs on Windows.

The first thing I had to do was to convert the project build system to Linux. The desirable way would be to write a parser that would go over the original build system and produce a Makefile. The advantage of this approach would be the ability to instantly obtain a new Makefile when someone changes the original project code. However, during the prototyping stage it makes a lot of sense to take shortcuts wherever possible, so I ended up writing a tiny wrapper program to replace the compiler and linker binaries. It would intercept its arguments, write them out to a log file and act as a proxy launching the original program. After that it was just a matter of making minor changes to the log file to convert it to a bash script and later on to a Makefile.

As an effect of the development done on Windows, the majority of files have the same set of defects that prevent them from being compiled by a linux-based compiler: paths in the "#include" directive often have incorrect case, reverse slashes and so on. Worst is that some of the "#include" directives included multiple files in one set of brackets. Luckily, this was easy to fix with a script that parsed the compilation log for the "file not found" errors, looked for the corresponding file ignoring the case and fixed up the source code. In the end I was left with about a dozen places that had to be fixed manually.

Implementing the compatibility layer.

I have done a quick research into the available options and saw that the only up-to-date solution implementing the VxWorks API on other OSs is "Xenomai". However, it is quite intrusive because it relies on a loadable kernel module and some patches to the linux kernel to function. Since we were not interested in getting realtime behaviour but wanted to run on both our OS and Linux and entirely in userspace, I decided to write yet another VxWorks emulation layer.
The original firmware comes as a single ELF file which is reasonable because in VxWorks all processes are implemented as threads in a shared address space. Besides, VxWorks provides a POSIX-compatible API for developers. So in order to identify which functions needed implementation it was enough to try linking the compiled code into a single executable.

"One weird trick" useful for creating porting layers and DDEKits is the GCC/clang option "include" which allows you to prepend an include to absolutely all files compiled. This is super useful. You can use such an uber-header to place the definitions of the data types and function prototypes for the target platform. Besides, you can use it to hook library calls at compile-time.

One of the problem that took a great amount of time was implementing synchronization primitives. Namely, mutexes and semaphores. The tricky semantic difference between semaphores and mutexes in VxWorks is that the latter are recursive. That means that once a thread has acquired a mutex, it is allowed to lock it any number of times as long as the lock/unlock count is balanced.
Before I realized this semantic difference, I couldn't figure out why the software would always lock up, and disabling locking altogether led to totally random crashes.

Ultimately I became frustrated and ended up with a simple implementation of a recursive mutex that has allowed me to move much further (Simple Recursive Mutex Implementation @ github). Later for the purposes of debugging I also added the options to print backtrace indicating the previous lock owner when trying to enter the critical section or when the spinlock took too many attempts.

Hunting for code defects

Uninitialized variables and what can we do about them

One problem I came across was that the code had a lot of uninitialized variables, hundreds of them. On the one hand, the proper solution is to inspect each case manually and write the correct initializer. On the other hand, the code works when compiled with the diab compiler so it must zero-initialize them.

So I went ahead and wrote a clang rewriter plugin to add the initializers: zero for the primitive types and a pair of curly braces for structs. (clang rewriter to add initializers). However, I realized that the biggest problem is that some functions use the convention of returning zero for indicating a failure while other return a non-zero value. This means, we cannot have a generic safe initializer that would make the code take the "fault" path when reaching the rewritten code. An alternative to manual inspection could be writing sophisticated rules for the rewriter to detect the convention used.

I ended up using valgrind and manually patching some warnings. AddressSanitizer was also useful. However fixing each warning and creating a blacklist is too tiresome. I ended up setting the breakpoint on the "__asan_report_error" function and a script that would make gdb print backtrace, return and continue execution.

Duplicate structures

One problem I supposed could be present in the code (due to the deep hirarchy of #ifdefs) is the presence of the structures with the same name but different content. I made up an example of a C program to demonstrate an effect where the compiler does not warn about the type mimatch but at runtime the code silently corrupts memory.

An example of duplicate struct definition

I figured out an easy way of dealing with the problem. I ended up using clang and emitting LLVM bittecode for each file instead of the object files with machine code. Then I linked them together into a single bitcode file and disassembled with llvm-dis.

The nice thing about llvm is that when it finds two structures having the same name but declared differently, it would append a different numeric suffix to the struct name. Then one could just remove the suffixes and look for unique lines with different structure declarations.

Luckily for me, there was only one place where I supposed an incorrect definition, and it was not in the part of the code I was executing, so I ruled out this option as a source of incorrect behavior.

Further work

Improving tooling for the 32-bit world

It is quite unfortunate but the project I have been working on is 32-bit. And one cannot simply convert it into a 64-bit one by a compiler flag. The problem is that the code has quite a lot of places where pointers are for some reason stored into packed structures and some other structures implicitly rely on the structure layout. So it is very difficult to modify this fragile mess.
It is sad that two great tools, MemorySanitizer and ThreadSanitizer, are not available for the 32-bit applications. It is understandable because the 32-bit address space is too tiny to fit the shadow memory. I am thinking of ways to make them available for the 32-bit world. So far I see two ways of solving the problem.
First, we can use the fragile and non-portable flag for the mmap (which is currently only supported on linux) to force allocations to below the 4gig limit. Then one could write an ELF loader that would load all the code below 4gigs, and use the upper memory range for shadows. Besides being non-portable, the other disadvantages of this approach include having to deal with 64-bit identifiers such as file handles.
Alternatively we could store the shadow in a separate process and use shared memory or sockets for communication. That would likely be at least an order slower than using the corresponding sanitizer in a 64-bit world, but likely still faster than valgrind and besides it is a compile-time instrumentation with more internal data from the compiler.

Verifying the porting was done correctly

Now I am left with one challenging task: verify the ported SW is identical to what could be built using the original build system.

One may notice that simply intercepting the calls to the compiler may not be enough because the build system may copy the files or export some shell variables during the build process. Besides, different compilers have different ways of handling "inline" and some other directives. It would be good to verify that the call graph of the original binary and the one produced by our Makefile is similar. (of course we will need to mark some library functions as leaf nodes and not analyze them). For a start I could try inspecting some of the unresolved symbols manually, but I'm thinking of automating the process. I think for this task I'll need a decompiler that can identify basic blocks. Probably Capstone engine should do the job. Any ideas on that?

P.S. Oh, and I once tried visualizing the dependency graph of separate ".o" files (before I realized I could just link them altogether and get the list of missing symbols) and trust me those graphs grow really fast. I have found out that a tool called "gephi" does a decent job at visualizing really huge graphs and supports Graphviz's dot as the input format.

EDIT 2015-11-10
The challenge is complicated by the fact that there some subprojects have multiple copies (with various changes) and one should also ensure that the headers and sources are picked up from the correct copy. However, I've found an easy and acceptable solution. I just wrote a tool that parses the call graph generated by IDA and for every edge in the graph it looks up the function names of the corresponding vertices. Then it just prints a list of pairs "A -> B" for every function A calling function B. After that, we can sort the file alphabetically and remove the nodes that are uninteresting to us (the OS and library functions). Next, we can compare the files side-by-side with a tool like kdiff3 (or we can do it automatically). Whenever there is a significant difference (for example, 5 callees are different for the same caller), we can inspect manually and verify we're compiling the correct file with the correct options. Using this method I have identified several places where we chose the wrong object file for linking and now we're only concerned with the porting layer and OS kernel, without having to worry about the application itself.

on making a DDE kit, kinda

2015-05-20T02:24:00.001+03:00

So I have this task of porting a huge piece of software running on a proprietary OS to another OS. And I don't even have a clue how to compile it (well I do but it builds on windows so it's almost irrelevant).

But luckily all code is linked into a single ELF file and the compilation produces intermediate object files. The first thought I had was to visualize the dependency graph of object files to find out who calls what. You can find the script below that will recursively walk the supplied directory and try to parse the import/export table with objdump. There are some areas for improvement (for example, parsing also the dynsym table with -T or parsing .a archives) but it did the work for me.

Unfortunately I realized that visualizing the graph with 30K edges was not even remotely a smart idea

https://gist.github.com/astarasikov/355bf825f130fe4b5633

What I've also found out was that the OS-specific code and objects was stored in a separate location (since it was a part of an SDK). Even if it were not, we could just remove those object files that were both present in our project and in the SDK. After that, all the functions that the application was requiring from the OS unsurprisingly ended up in the "UNDEFINED" node and there were only 200 of them which gives me some hope.

This approach can also be used for other use-cases. For example, porting drivers from Linux/FreeBSD to exotic platforms - first build the binaries, then pick as many of them as you can to minimize the required functions list. I find dealing with compiled code easier because build systems, C macros and ifdefs just drive me insane.

GCOV is amazing yet undocumented

2015-05-15T08:27:00.003+03:00

One useful technique for maintaining software quality is code coverage. While routinely used by high-level developers it is completely forgotten by many C hackers, especially when it comes to kernel. In fact, Linux is the only kernel which supports being compiled with the GCOV coverage tool.

GCOV works by instrumenting your code. It inserts some code to increment the stats counters around each basic block. These counters reside in a special section of your ELFs. During compilation, GCC generates a ".gcno" file for each ".c" file. These files allow the "gcov" tool to lookup function names and other info using the integer IDs (which are specific to each file).

At runtime, an executable built with GCOV produces a file called ".gcda" which contains the values of the counters. During the executable initialization, constructors (which are function pointers in the ".ctors" section) are called. One of them is "__gcov_init" which registers a certain ".o" file inside libgcov.

Since we're running in kernel or "bare-metal", we don't have neither libgcov nor file system to dump the ".gcda" files. But one should not fear GCOV! Adding it to your kernel is just a matter of adding one C file (which is mostly shamelessly copy-pasted from linux kernel and gcc sources) and a couple CFLAGS. In my example I'm using the LK kernel by Travis Geiselbrecht (https://github.com/travisg/lk). I've decided to just print out the contents of the ".gcda" files to the serial console (UART) as hex dump and then use an AWK script and the "xxd" tool to convert them to binaries on the host. This works completely fine since these files are typically below 2KB in size.

An important thing to notice: if your kernel does not contain the ".ctors" section and does not call the constructors, be sure to add them to the ld script and add some code to invoke them. For example, here's how LK does that: https://github.com/travisg/lk/blob/master/top/main.c#L42

You can see the whole patch below.
https://github.com/astarasikov/lk/commit/2a4af09a894194dfaff3e05f6fd505241d54d074

After running the "gcovr" tool you can get a nice HTML with summary and see which lines were executed and which were not and add the tests for the latter. Woot!

abusing Mesa by hooking ELFs and ioctl

2014-10-24T23:16:00.001+04:00

At work I had several occasions when I needed to hook a function on a Linux system. I will not go deep into technical details but the examples provided in the links contain the code which can be easily plugged into a project.

A simple case.
First let us consider a case when we only need to replace a function in our application without replacing it in dynamic libraries loaded by the application.

In Linux, functions in dynamic libraries are resolved using two tables: PLT (procedure linkage table) and GOT (global offset table). When you call a library function, you actually call a stub code in the PLT. What PLT essentially does is loading the function address from the GOT. On an X86_64, it does that with the "jmpq *" instruction (ff 25). Initially GOT contains the pointer back to PLT, which will invoke the resolver which will write the correct entry to GOT.

So, to replace a library function, we must patch the entry in GOT. How to find it? We can decode the offset to GOT (which is the relative to PLT + JMPQ instruction length) from the JMPQ instruction. How do we find the PLT then? It's simply the function pointer - that is, in C, use the function name as a pointer.

We should call "dlsym" to forcibly resolve the entry in GOT before replacing it. For example, if we want to swap two dynamic functions. A complete example is available as a github gist:
https://gist.github.com/astarasikov/9547918

A more advanced example
The approach described above will not work when you want to replace a function that is used by some dynamically loaded binary since the library will have its own set of GOT and PLT tables.

On X86_64, we can find the PLT table by examining the contents of the ELF file. The section we care about is called ".rela.plt".

We need to example the ELF header (Ehdr), find the section header (Shdr) and find the "dynsym" and "rela.plt" tables. The dynsym table unsurprisingly contains the pointers to dynamically resolved functions and can be used to find the function by name. You can find the link to the code below. I've mmap'ed the library file to get a pointer to its contents and use it to read the data. Some examples floating around the internet are using the "malloc", "read()" and "memcpy" calls for the same purpose. To get the pointer to the library in the application's virtual memory space, you can call "dlopen" and dereference the pointer returned by the function. You will need this address to convert the PLT offset into the address. Technically you can get away without reading/mmap'ing the library from disk, but since some sections are not mapped for reading, you will need to use "mprotect" to access them without receiving SIGSEGV.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/hook-elf.c#L126

Alternatives
The most straightforward alternative for such runtime patching is using the "LD_PRELOAD" variable and specify a library with the implementation of the hooked routine. The linker will then redirect all calls to that function in all libraries into the preloaded library. However, the obvious limitation of this approach is that it may be hard to get it working if you preload multiple libraries overriding the same symbol. It breaks some tools like "apitrace" (which is a tool to trace and replay OpenGL calls).

Practical example with Mesa
Many GPUs nowadays contain encoders which will encode video to H.264 or other popular formats. Intel GPUs (namely HD4000 and Iris Graphics series) have an encoder for H.264. Some solutions like NVIDIA Grid utilize hardware capabilities to stream games over the network or provide the remote desktop facilities with low CPU load.

Unfortunately both proprietary and the open-source drivers for the Intel hardware lack the ability to export and OpenGL resource into the encoder library which is a desired option for implementing a screen recorder (actually the proprietary driver from the SDK implements the advanced encoding algorithm as a binary library but uses the pre-compiled versions of the vaapi and libdrm for managind resources).

If we don't share a GL texture with the encoder, we cannot use the zero-copy path. The texture contents will need to be read with "glReadPixels()" and then converted to the NV12 or YUV420 format (though it can be done by a GLSL shader before reading the data) and re-uploaded to the encoder. This is not an acceptable solution since each frame will take more than 30ms and cause 80% CPU thus leaving no processing power for other parts of the application. On the other hand, using a zero-copy approach will allow us to have a smooth 60FPS performance - the OpenGL side will not be blocked by "glReadPixels()" and CPU load will never exceed 10 percent.

To be more precise, resource sharing support is present if one uses the "EGL" windowing system and the one of the MESA extensions. Technically, all GPU/encoder memory is managed by the DRM framework in Linux kernel. There is an extension which allows to obtain a DRM handle (which is an uint32) from an OpenGL texture. It is used by Wayland in the Weston display server. An example of using VAAPI to encode a DRM handle can be found in Weston source code:
http://cgit.freedesktop.org/wayland/weston/tree/src/vaapi-recorder.c

Now, the problem is to find a handle of an OpenGL texture. Under EGL that can be done by creating an EGLImage from the texture handle and subsequently calling eglExportDRMImageMESA.

In my case the problem was that I didn't want to use EGL because it is quite difficult to port a lot of legacy GLX codebase to EGL. Besides, GLX support is more mature and stable with the NVIDIA binary driver. So the problem is to get a DRM buffer out of a GL texture in GLX.

Unfortunately GLX does not provide an equivalent extension and implementing one for Mesa and X11 is rather complicated due to the complexity of the code. One option would be to link against Mesa and use its internal headers to reach for the needed resources as done by Beignet, the open-source OpenCL driver for Intel (http://cgit.freedesktop.org/beignet/tree/src/intel/intel_dri_resource_sharing.c) which is error-prone. Besides, setting up such a build environment is complicated.

Luckily, as we've seen above, the memory is managed by the DRM which means every memory allocation is actually an IOCTL call to kernel. Which means we can hook the ioctl and steal the DRM handle and later use it as the input to VAAPI. We need to look for the DRM_I915_GEM_SET_TILING ioctl code.

The obvious limitation is that you will need to store a global reference which will be written by the ioctl hook which makes the code thread-unsafe. Luckily most OpenGL apps are single-threaded and even when they're not, there are less than dozen threads which can access OpenGL resources so lock contention is not an issue and a pthread mutex can be used.

Another limitation is that to avoid memory corruption or errors, we need to carefully track the resources. One solution would be to allocate OpenGL texture prior to using VAAPI and deallocate after the encoder is destroyed. The other one is to hook more DRM calls to get a pointer to the "drm_intel_bo" structure and increase its reference count.

https://github.com/astarasikov/sxge/blob/vaapi_recorder/apps/src/sxge/apps/demo1_cube/demo1_cube.cc#L119

GSoC results or how I nearly wasted summer

2014-09-07T21:57:00.000+04:00

This year I got a chance to participate in the Google Summer of Code with the FreeBSD project.
I have long wanted to learn about the FreeBSD kernel internals and port it to some ARM board but was getting sidetracked whenever I tried to do it. So it looked like a perfect chance to stop procrastinating.

Feel free to scroll down to the "other stuff" paragraph if you don't feel like reading thre paragraphs of a typical whinig associated with debugging C code.

Initially I wanted to bring up some real piece of hardware. My choice was the Nexus 5 phone which has Qualcomm SoC. However, some of the FreeBSD developers suggested that instead I port the kernel to Android Emulator. It sounded like a sane idea since Android Emulator is available for the major OSes and is easy to set up which will allow to expose more people to the project. Besides, unlike QEMU, it can emulate some peripherals specific to a mobile device such as sensors.

What is Android Emulator? In fact, it is an emulator based on QEMU which emulates a virtual board called "Goldfish" which includes a set of devices such as interrupt controller, timer, input devices. It is worth noting that Android Emulator is based on an ancient version of QEMU though starting with Android L one can obtain the binaries of the emulator from git which is based on a more recent build.

I have started from implementing the basics: the interrupt controller, the timer and the UART driver. That has allowed me to see the boot console but then I got stuck for literally months with kernel crashing at various points in what seemed like a totally random fashion. I have spent nights single-stepping the kernel with my sight slowly turning from barely red eyes to emitting hard radiation. It was definitely a problem with caching but ultimately I gave up trying to fix it. Since I knew that FreeBSD kernel worked fine on ARM in a recent version of QEMU, it was clearly a bug in Android Emulator, even though it has not manifested itself in Linux. Since Android Emulator is going to eventually get updated to a new QEMU base and I was running out of time, I decided to just use a nightly build of emulator instead of the one coming with the Android SDK and that fixed this particular issue. But hey! At least I've learnt about the FreeBSD virtual memory management and the UMA allocator the hard way.

Next up was fighting the timer and random freezes while initializing the random number generator (sic!). That was easy - turns out I just forgot to call the function. Anyway it would make sense to move that call out of the board files and call it for every board that does not define the "NO_EVENTTIMERS" option in the kernel config.
Fast forward to the start of August when there are only a couple days left and I still don't have a working block device driver to boot rootfs! Writing a MMC driver turned out to be very easy and I started celebrating when I got the SD card image from Raspberry PI booting in Android Emulator and showing the login prompt.

It worked but with a huge kludge in the OpenFirmware bindings. Somehow one of the functions which should've returned the driver bus name returned (seemingly) random crap so instead of a NULL pointer check I had to add a check that the address points to the kernel VM range. I have then tracked down the real source of the problem.
I was basing my MMC driver on the LPC SoC driver. LPC driver itself is suspicious - for example, the place where DMA memory is allocated says "allocating Framebuffer memory" which may be an indicator that it was written in a hurry and possibly only barely tested. At the MMC bus attach routine, it was calling the "device_set_ivars" function and setting the ivars pointer to the mmc host structure. I have observed the similiar pattern in some other drivers. In the OpenFirmware code, the ivars structure was dereferenced as a completely other structure and a string was taken from a member of this structure.

How the hell did it happen in a world where people are using the recent Clang compiler and compile with "-Wall" and "-Werror"? Uhh, well, if you're dealing with some kind of OOP in plain C, casting to void pointer and back and other ill-typed wicked tricks are inevitable. Just look at that prototype and cry with me:

>> device_set_ivars(device_t dev, void *ivar);

For now, I'm pushing the changes to the following branch. Please note that I'm rebasing against master and force-pushing in process if you're willing to test.
https://github.com/astarasikov/freebsd/tree/android_goldfish_arm_master
So, what has been achieved and what am I still going to do? Well, debugging MMU issues and OpenFirmware bug has delayed me substantially so I've not done everything I've planned. Still

Basic hardware works in Goldfish: Timer, IRQs, Ethernet, MMC, UART
It is using NEWCONS for Framebuffer Console (which is cool!)
I've verified that Android Emulator works in FreeBSD with linuxulator
I have fixed Android Emulator to compile natively on FreeBSD and it mostly works (at least, console output) except graphics (which I think should be an easy fix. there are a lot of places where "ifdef __linux__" is completely unjustified and should rather be enabled for all unices)

How to test and contribute?
You can use the linux build of Android Emulator using Linuxulator. Unfortunately, linuxulator only supports 32-bit binaries so we cannot run nightly binaries which have the MMU issues fixed.
Therefore, I have fixed the emulator to compile natively on FreeBSD! I've pushed the branch named "l-preview-freebsd" to github. You will also need the gtest. I cloned the repo from CyanogenMod and pushed a tag named "l-preview-freebsd", which is actually "cm-11.0".

Compiling the emulator:
>> git clone https://github.com/astarasikov/qemu.git
>> https://github.com/astarasikov/android_external_gtest
>> cd cd android_external_gtest
>> git checkout cm-11.0
>> cd ../qemu.git
>> git checkout l-preview-freebsd
>> bash android-build-freebsd.sh

Please refer to README file in git for the detailed instructions on building kernel.

The only thing that really bothers me about low-level and embedded stuff is that it is extremely difficult to estimate how long it may take to implement a certain feature because every time you end up debugging obscure stuff and the "useful stuff you learn or do" / "time wasted" ratio is very small. On a bright side, while *BSD lag a bit behind Linux in terms of device drivers and performance optimizations, reading and debugging *BSD code is much easier than GNU projects like EGLIBC, GCC and GLib.

Other stuff.
Besides GSoC at the start of summer I had a chance to work with Chris Wade (who is an amazing hacker, by the way). We started working on an ARM virtualization project and have spent nice days debugging caching issues and instruction decoding while trying to emulate a particular cortex A8 SoC on an A15 chip. Unfortunately working on GSoC, going to a daily job and doing this project simultaneously turned out to be surprisingly difficult and I had to quit at least one of the activities. Still I wish Chris good luck with this project and if you're interested in virtualization and iPhones, sign up for early demo at virtu.al

Meanwhile I'm planning to learn ARMv8 ISA. It's a pity there is no hackable hardware available for reasonable prices yet. QEMU works fine with VirtIO peripherals though. But I'm getting increasingly worried about devkits costing more and more essentially making it more difficult for a novice to become an embedded software engineer.

On Free Software and if proprietary software should be considered harmful

2014-06-06T08:01:00.000+04:00

I communicate with a lot of people on the internet and they have various opinions on FOSS ranging from "only proprietary software written by a renowned company can deliver quality" to "if you're using binary blobs, you're a dick". Since these issues arise very often in discussions, I think I need to write it up so I can just shove the link next time.

On the one hand, I am a strong proponent of free and open-source software and I have contributed to several projects related to porting Linux to embedded hardware, especially mobile phones (htc-linux.org, Replicant and I also consulted some of the FreeSmartPhone.org people). Here are the reasons I like free software (as well as free hardware and free knowledge):

The most important for me is that you can learn a lot. I have mostly learnt C and subsequently other languages by hacking on the linux kernel and following interesting projects done by fellow developers
Practically speaking, it is just easier to maintain and develop software when you have the source code
When you want to share a piece of data or an app with someone, if you deal with closed software, you force them into buying a paid app which may compromise their security
You can freely distribute your contributions, your cool hacks and research results without being afraid of pursuit by some patent troll

But you know, some people are quite radical in their FOSS love. They insist that using anything non-free is a sin. While I highly respect them for their attitude, I have a different point of view and I want to comment on some common arguments against using or developing non-free software:

"Oh, it may threaten your privacy, how can you run untrusted code"? My opinion here is that running untrusted firmwares or drivers for devices is not a big deal because unless you manufacture the SoC and all the peripherals yourself, you can not be sure of what code is running in your system. For example, most X86 CPUs have SMM mode with a proprietary hypervisor, most ARMs have TrustZone mode. If your peripheral does not require you to load the firmware, it just means that the firmware is stored in some non-volatile memory chip in hardware, and you don't have the chance to disable the device by providing it with a null or fake binary. On the other hand, if your device uses some bus like I2C or USB which does not have DMA capabilities or uses IOMMU to restrict DMA access, you are relatively safe even if it runs malicious code.
"You can examine open source software and find backdoors in it". Unfortunately this is a huge fallacy. First of all, some minor errors which can lead to huge vulnerabilities can go unnoticed for years. Recent discoveries in OpenSSL and GnuTLS came as a surprise to many of us. And then again, have you ever tried to develop a piece of legacy code with dozens of nested ifdefs when you have no clue which of them get enabled in the final build? In this case, analyzing disassembled binaries may even be easier.
"By developing or using non-free software you support it". In the long run it would be better for humanity to have all knowledge to be freely accessible. However, if we consider a single person, their priorities may differ. For example, until basic needs (which are food, housing, safety) are satisfied, one may resort to developing close-sourced software for money. I don't think it's bad. For me, the motivation is knowledge. In the end, even if you develop stuff under an NDA, you understand how it works and can find a way to implement a free analog. This is actually the same reason I think using proprietary software is not bad in itself. For example, how could you expect someone to write a good piece of free software - an OpenGL driver, an OS kernel, a CAD until they get deeply familiar with existing solutions and their limitations?

Regarding privacy, I am more concerned about the government and "security" agencies like NSA which have enough power to change laws and fake documents. Officials change, policy changes and even people who considered themselves safe and loyal patriots may be suddenly labeled traitors.

In the end, it boils down to individual people and communities. Even proprietary platforms like Palm, Windows Mobile or Apple iOS had huge communities of helpful people who were ready to develop software, reverse-engineer hardware and of course help novices. And there are quite some douchebags around free software. Ultimately, just find the people you feel comfortable around, it is all about trust.

Minor notes about recent stuff

2014-06-06T07:56:00.003+04:00

I've been quite busy with work and university recently so I did not have much time to work on my projects or write rants, so I decided to roll out a short post discussing several unrelated ideas.

On deprecating Linux (not really)
Recently several Russian IT bloggers have been writing down their ideas about what's broken with Linux and how it should be improved. Basically, one guy started by saying we need to throw away POSIX and write everything from scratch and the other two are trying to find a compromise. You may fire up Google Translate and follow the discussion at http://lionet.livejournal.com/131533.html

I personally think that these discussions are a bit biased because all these guys are writing from the point of view of an engineer working on distributed web systems. At the same time, there are domains like embedded software, computer graphics, high-performance computations which have other challenges. What's common is that people are realizing the obvious idea: generic solutions designed to work for everyone (like POSIX) limit the performance and flexibility of your system, while domain-specific solutions make your job easier but they may not fit well into what other people are doing.

Generally both in high-performance networking and graphics people are trying to do the same: reduce the number of context switches (it is true that on modern X86 a context switch is relatively cheap and we may use a segmented memory model like in Windows CE 5.0 instead of the common user/kernel split, but the problem is not the CPU execution mode switch but that system calls like "flush", "ioctl" and library calls like "glFlush()" are used as a point of synchronization where a lot of additional work is done besides the call itself) and move all execution into userspace. Examples include asynchronous network servers using epoll and cooperative scheduling, Intel's networking stack in user land (Netmap, DPDK), modern Graphics APIs (Microsoft DirectX 12, AMD Mantle, Apple Metal). The cost of maintaining abstractions - managing buffers and contexts in drivers - has risen so high that neither hardware vendors want to write complex drivers nor they can deliver performance. So, we'll have to step back and learn to use hardware wisely once again.

Actually I like the idea of using minimalistic runtimes on top of hypervisors like Erlang on Xen from the point of simplicity. Still, security and access control are open questions. For example, capability-based security as in L4, has always looked interesting, but whether cheap system calls and dramatically reduced "trusted code base" promises have been fulfilled is very arguable. Then again, despite the complexity of linux, its security is improving because of control groups which are heavily utilized by docker and systemd distros. Another issue is that lightweight specific solutions are rarely reusable. Well, from the economic point of view a cheap solution that's done overnight and does its job is just perfect, but generally the amount of NIH and engineers basically writing the same stuff - drivers, applications, libraries and servers in an absolutely identical way but for a dozen identical OSs is amazing and rather uncreative.

Anyway, it looks like we're there again: rise of Worse is Better

Work (vaapi, nix).
At work, among other things, I was asked to figure out how to use the Intel GPU for H.264 video encoding. Turns there are two alternatives: the open source VAAPI library and the proprietary Intel Media SDK. Actually, the latter still uses a modified version of VAAPI, but I feel that fitting it into our usual deployment routine is going to be hard, because even basic critical components of the driver stack, such as the kernel KMS module and libdrm.so are provided in the binary form.

Actually VAAPI is very low-level. One thing that puzzled me initially is that it does not generate H.264 bitstream. You have to make it yourself and feed into the encoder via a special buffer type. Besides, you have to manually take the reconstructed picture and feed it as a reference for subsequent frames. There are several implementations using this API for encoding: gstreamer, ffmpeg, vlc. I have spent quite some time until I got it to encode a sample chunk of YUV data. Ultimately my code started looking identical to the "avcenc.c" sample from libva except that encoder data is stored in a "context" structure instead of global variables.

My advice is that if you want to learn about video decoding APIs on linux and Android, take a look at Chromium source code (well, you may expect to find a solution for any relevant computer science or engineering problem given how much code it contains). And also take a look at GST, FFMPEG and VLC. Notice how each of them has its own way of managing buffers (and poor error handling btw).

Another thing we're using at work is the Nix package manager. I have always wanted to try it but did not really do it until coming to work to this place. Nix is a fantastic concept (Even +Norman Feske got inspired by it for their Genode OS). However, we've notices a slowdown when compiling software under Nix. Here are some of my observations:

Compiling a simple C file with just "main" function takes 30ms in linux but >300ms in Nix chroot. Nix installs each library to a separate prefix and uses LDFLAGS and CFLAGS to direct the compiler to use them. Gcc then iterates over each of these directories trying to find each library which ends up being slow. Anyone knows a workaround?
Running gcc under the "perf" profiler shows that it spends most of its time in multibyte string collation functions. I find it ridiculous that exporting "LC_ALL=C" magically makes the compilation time fall down to 100ms.

Hacking on FreeBSD

As a part of my GSoC project I got the FreeBSD kernel booting on Android emulator and I've just have to write the virtual ethernet and MMC drivers. Unfortunately my university has distracted me a lot from this project but now that I have time I hope to finish it by midterm. And then I'll probably port FreeBSD to Qualcomm MSM8974 SoC. Yay, red Nexus 5 is an awesome phone!

My little project (hacking is magic)

Long time ago I decided to give Windows Mobile emulation a shot and got the kernel to start booting in QEMU. Since Microsoft's Device Emulator was open-source and emulated a real Samsung S3C2410 SoC, it was easy. I still plan to get it fully working one day.

But QEMU is a boring target actually. What's more interesting is developing a bare-metal hypervisor for A15 and A7 CPUs. Given the progress made by Rob Clark on reverse-engineering Qualcomm Adreno GPU driver, I think it could be doable with reasonable effort to emulate the GPU and consequently enough hardware to run Windows Mobile and Windows RT. A very interesting thing is that ARMs trap almost all coprocessor registers for guest access (privilege levels 0 and 1) meaning you can fake any CPU ID or change memory access behavior by modifying caching and buffering settings.

What is really interesting is that there are a lot of Chinese phones which copy Nokia smartphones, iPhones, Android phones. Recent revisions of Mediatek MTK SoCs are Arm A7 meaning they support virtualization. Ultimately it means someone could make a high quality replica of any phone and virtualize any SoC without a trace which has interesting security implications.

Software sucks!

The other day some systemd update came out which totally broke my debian bootup. Now, there's a race condition between the encrypted disk password entry and root password entry for "recovery". Then, latest kernel (3.15-rc3) OOPSes and crashes randomly on ACPI events which subsequently breaks networking and spawning new processes.

Ultimately after a day of struggle my rootfs broke and after and FSCK everything was gone. So now I'm deciding what distro I should try instead. Maybe ArchLinux? Either way, I have to build a lot of software from source and install to /usr/local or custom prefixes for development.

The easiest way would be to install into a VM in OS X. But then, I want to fix issues with drivers, especially GPU switching in the macbook. On the other hand, I spent ages fixing switchable graphics on an older Sony VAIO laptop and resorted to an ugly hack to force Intel GPU. Since GPU switching is not working properly in Linux, maybe I should write a graphics multiplexor driver for FreeBSD and ditch Linux altogether? FreeBSD looks much more attractive these days with clang and no systemd and no FreeDesktop.org and no Lennart Poettering.

A story about clang tools

2014-03-18T02:53:00.001+04:00

I've always wanted to try writing a toy compiler, but have not made myself actually learn the theory of parsing (I plan to do it and post some notes into the blog soon though). However, recently I've been playing with Clang and LLVM. I've not yet used it for compiling, but I want to share my experience of using and extending Clang's error detection tools.

LLVM is a specification of platform-independent bytecode and a set of tools aimed to make the development of JITed interpreters and portable native binaries easier. A significant portion of work for LLVM was done by Apple and nowadays it is widely used in industry. For example, NVIDIA uses it for compiling CUDA code, AMD uses it to generate shaders in its open-source driver. Clang is a parser and a compiler for a set of C-like languages, which includes C, C++ and Objective-C. This compiler has several properties that may be interesting for developers:

It allows you to traverse the parsed AST and transform it. For example, add or remove the curly brackets around if-else conditionals to troll your colleagues.
It allows you to define custom annotations via the __attribute__ extension which again can be intercepted after the AST is generated but is not yet compiled.
It supports nearly all the features of all revisions of C and C++ and is compatible with the majority of GCC compiler options which allows to use it as a drop-in replacement. By the way, FreeBSD has switched to Clang, and on Apple OS X gcc is actually a symlink to clang!

So, why should you care? Imagine how many times you wanted to add some cool feature but realized macros were not enough. Or you wanted to enforce some code convention? With clang one could easily write a static code analyzer that would catch the notorious Apple "double fail" bug. Which makes me wonder why they did not use their own technology :)

LLVM provides several frameworks for finding bugs at runtime. For example, AddressSanitizer and MemorySanitizer to catch access to uninitialized or unallocated memory.

I was given the following interesting problem at work: build some solution that would allow to detect where the application is leaking memory. Sounds like a common problem, with no satisfying answer.

Using Valgrind is prohibitively slow - a program running under it can be easily 50 times slower than without it.
Using named SLAB areas (like linux kernel does) is not an option. First of, in the worst case using SLAB means only half of the memory is available for the allocation. Secondly, such approach allows to know objects of what class are occupying the memory, but now where and why they were allocated
Using TCMalloc which hooks malloc/free calls also turned out to be slow enough to cause different behaviour in release and debugging environment, so some lightweight solution had to be designed.

Anyway, while thinking of a good way to do it, I found out that Clang 3.4 has something called LeakSanitizer (also lsan and liblsan) which is already ported to GCC 4.9. In short, it is a lightweight version of tcmalloc used in Google Perftools. It collects the information about memory allocations and prints leak locations when the application exits. It can use the LLVM symbolizer or GCC libbacktrace to print human-readable locations instead of addresses. However, it has some issues:

It has an explicit check in the __lsan::DoLeakCheck() function which disallows it to be called twice. Therefore, we cannot use it to print leaks at runtime without shutting down the process
Leak detection cannot be turned off when it is not needed. Hooked malloc/memalign functions are always used, and the __lsan_enable/disable function pair only controls whether statistics should be ignored or not.

The first idea was to patch the PLT/GOT tables in ELF to dynamically choose between the functions from libc and lsan. It is a very dirty approach, though it will work. You can find a code example at https://gist.github.com/astarasikov/9547918.

However, patching GOT we only divert the functions for a single binary, and we'd have to patch the GOT for each loaded shared library which is, well, boring. So, I decided to patch liblsan instead. I had to patch it either way, to remove the dreaded limitation in DoLeakCheck. I figured it should be safe to do. Though there is a potential deadlock while parsing ELF header (as indicated by a comment in lsan source), you can work around it by disabling leak checking in global variables.

What I did was to set up a number of function pointers to the hooked functions, initialized with lsan wrappers (to avoid false positives for memory allocation during libc constructors) and add two functions, __lsan_enable_interceptors and __lsan_disable_interceptors to switch between libc and lsan implementations. This should allow to use leak detection for both our code and third-party loadable shared libraries. Since lsan does not have extra dependencies on clang/gcc it was enough to stick a new CMakeLists.txt and it can now be built standalone. So now one can load the library with LD_PRELOAD and query the new functions with "dlsym". If they're present - it is possible to selectively enable/disable leak detection, if not - the application is probably using vanilla lsan from clang.

There are some issues, though

LSAN may have some considerable memory overhead. It looks like it doesn't make much sense to disable leak checking since the memory consumed by LSAN won't be reclaimed until the process exits. On the other hand, we can disable leak detection at application startup and only enable it when we need to trace a leak (for example, an app has been running continuously for a long time, and we don't want to stop it to relaunch in a debug configuration).
We need to ensure that calling a non-hooked free() on a hooked malloc() and vice-versa does not lead to memory corruption. This needs to be looked into, but it seems that both lsan and libc just print a warning in that case, and corruption does not happen (but a memory leak does, therefore it is impractical to repeatedly turn leak detection on and off)

We plan to release the patched library once we perform some evaluation and understand whether it is a viable approach.

Some ideas worth looking into may be:

Add a module to annotate and type-check inline assembly. Would be good for Linux kernel
Add a module to trace all pointer puns. For example, in Linux kernel and many other pieces of C code, casting to a void pointer and using the container_of macro is often used to emulate OOP. Now, using clang, one could possibly allow to check the types when some data is registered during initialization, casted to void and then used in some other function and casted back or even generate the intermediate code programmatically.
Automatically replace shared variables/function pointer calls with IPC messages. That is interesting if one would like to experiment with porting Linux code to other systems or turning Linux into a microkernel

porting XNU to ARM (OMAP5 CPU)

2014-01-27T02:56:00.002+04:00

Two weeks ago I have taken some time to look at the XNU port to the ARM architecture done by the developer who goes by the handle "winocm". Today I've decided to summarize my experience.

Here is a brief checklist if you want to start porting XNU to your board:

Start from reading Wiki https://github.com/darwin-on-arm/wiki/wiki/Building-a-bootable-system
Clone the DeviceTrees repository: https://github.com/darwin-on-arm/DeviceTrees . Basically, you can use slightly modified DTS files from Linux, but due to the fact that DTC compiler is unfinished, you'll have to rename and comment out some entries. For example, macros in included C headers are not expanded, so you'll have to hardcode values for stuff like A15 VGIC IRQs
Get image3maker which is a tool to make images supported both by GenericBooter and Apple's iBoot bootloaders https://github.com/darwin-on-arm/image3maker
Use the DTC from https://github.com/winocm/dtc-AppleDeviceTree to compile the abovementioned DTS files
Take a look at init/main.c . You may need to add a hackish entry the way it's done for the "HD2" board to limit the amount of RAM available.
I have built all the tools on OS X, but then found out that it's easier to use the prebuilt linux chroot image available at: https://github.com/stqism/xnu-chroot-x86_64

The most undocumented step is actually using the image3maker tool and building bootable images. You have to put the to the "images" directory in the GenericBooter-next source. As for the ramdisk, you may find some on github or unpack the iphone firmware, but I simply created an empty file, which is OK since I've not got that far in booting.

Building GenericBooter-next is straightforward, but you need to export the path to the toolchain, and possibly edit the Makefile to point to the correct CROSS_COMPILE prefix

rm images/Mach.*
../image3maker/image3maker -f ../xnu/BUILD/obj/DEBUG_ARM_OMAP5432_UEVM/mach_kernel -t krnl -o images/Mach.img3
make clean
make

For the ramdisk, you should use the "rdsk" type, and "dtre" for the Device Tree (apparently there's also xmdt for xml device tree, but I did not try that);

Booting on the omap5 via fastboot (without u-boot):
./usbboot -f &
fastboot -c "-no-cache serial=0 -v -x cpus=1 maxmem=256 mem=256M console=ttyO2" -b 0x83ff8040 boot vmlinux.raw

Some notes about the current organization of the XNU kernel and what can be improved:

All makefiles containing board identifiers should be sorted alphabetically, just for convenience when adding newer boards.
Look into memory size limit. I had to limit the RAM to 256Mb in order to boot. If one enables the whole 2Gb available on the OMAP5432 uEVM board, the kernel fails to "steal pages" during the initialization (as can be seen in the screenshot).
I have encountered an issue

OMAP5 is an ARM Cortex-A15 core. Currently XNU port only has support for the A9 CPU, but if we're booting without SMP and L2 cache, the differences between these architectures are not a big problem. OMAP5 has a generic ARM GIC (Global Interrupt Controller), which is actually compatible to the GIC in the MSM8xxx CPUs, namely with APQ8060 in HP TouchPad, the support for which was added by winocm. UART is a generic 16550, compatible to the one in OMAP3 chip. Given all this, I have managed to get the kernel basically booting and printing some messages to the UART.

Unfortunately, I have not managed to bring up the timer yet. The reason is that I was booting the kernel via USB directly after OMAP5 internal loader, skipping u-boot and hardware initialization. Somehow the internal eMMC chip in my board does not allow me to overwrite the u-boot partition (although I did manage to do it once when I received the board)

I plan to look into it once again, now with hardware pre-initialized by u-boot, and write a detailed post. Looks like the TODO is the following:

Mirror linux rootfs (relying on 3rd party github repos is dangerous)
Bringing up OMAP5 Timer
Building my own RAMDisk
Enabling eMMC support
Playing with IOKit

I don't even

2013-12-27T03:43:00.000+04:00

Look, to some extent I like Mac OS X. It's a UNIX, it has some software (though, very little compared to linux). I like the objective-c language, and developing for iOS is a huge buzz with high salaries. Oh, and it has DTrace. Other than that I don't really have a reason to like it.

Some things about this OS are undocumented and badly broken. Take file system management for example. Tonight I looked at the free disk space and found out that some entity named "Backups" occupied 40GB. Turns out it's Time Machine's local snapshots. The proper way to get rid of them would be to disable automatic backups in Time Machine. One can also disable local snapshots from command line like:

tmutil disablelocal
And this is where I've effed up. This did not remove any space. So, I went ahead and removed the ".MobileBackups" and "Backups.backupdb" folders. NEVER EVER FUCKING DO IT. The thing is that Time Machine, according to some reports, creates hard links to directories (sic!), and now I just lost those 40 gigs - they ain't showing up in "du -s", but they show up as "Others" in the disk space info. Sooo. next time, use the "tmutil delete" to delete those directories.

Ok, I've re-enabled the snapshots with "tmutil enablellocal" and disabled them with the GUI. After that, I opened the Disk Utility and clicked "Verify Disk". It reported that the root FS was corrupted, I had to reboot to the recovery image and run the "Disk Repair". It's really confusing that OS X can perform live fsck (it just freezes all IO operations until fsck is done) but can't repair a live FS.

And a bonus picture for you. In the process I've unplugged my USB disk several times without unmounting and the FS got corrupted. This is what I got for trying to repair it.

KVM on ARM Cortex A15 (OMAP5432 UEVM)

2013-11-26T01:04:00.000+04:00

Hi! In this post I'll summarize the steps I needed to do in order to get KVM working on the OMAP5 ARM board using the virtualization extensions.

ARM A15 HYP mode.

In Cortex-A15, ARM have introduced a new operating mode called HYP (hypervisor). It has lower permissions than TruztZone. In fact, HYP splits the "insecure" world into two parts, one for hypervisor and the other one for the guests. By default on most boards the system boots into the insecure non-HYP mode. To enter the HYP mode, one needs to use platform-specific ways. For OMAP5 this involves making a call to the TrustZone which will restart the insecure mode cores.

A good overview of how virtualization support for ARM was added to Linux is available at LWN.
http://lwn.net/Articles/557132/

Ingo Molnar HYP patch

There was a patch for u-boot to enable entering HYP mode on OMAP5 by Ingo Molnar. Too bad, it was either written for an early revision of omap5 or poorly tested. It did not work for my board, so I had to learn about OMAP5 TrustZone SCM commands from various sources and put up a patch (which is integrated to my u-boot branch).
If you're interested, you can take a look at the corresponding mailing list entry.

http://u-boot.10912.n7.nabble.com/RFD-OMAP5-Working-HYP-mode-td163302.html

Preparing u-boot SD card

Get the android images from TI or build them yourself. You can use the usbboot tool to boot images from the PC. Or even better, you can build u-boot (this is the preferred way) and then you won't need android images. But you may need the TI GLSDK for the x-loader (MLO). Creating an SD card with u-boot is the same as for omap3 and omap4, so I'll leave this out. There is some magic with creating a proper partition table, so I advise that you get some prebuilt image (like ubuntu for pandaboard) and then replace the files in the FAT partition.

http://software-dl.ti.com/omap/omap5/omap5_public_sw/OMAP5432-EVM/5AJ_1_5_Release/index_FDS.html
http://www.omappedia.com/wiki/6AJ.1.1_Release_Notes

Please consult the OMAP5432 manual on how to set up the DIP switches to boot from SD card.

Source code

For u-boot:
https://github.com/astarasikov/uboot-tegra/tree/omap5_hyp_test

For linux kernel:
https://github.com/astarasikov/linux/tree/omap5_kvm_hacks

Linux kernel is based on the TI omapzoom 3.8-y branch. I fixed a null pointer in the DWC USB3 driver and some issues with the 64-bit DMA bitmasks (I hacked the drivers to work with ARM LPAE, but this probably broke them for anything else. The upstream has not yet decided on how this should be handled).

Compiling stuff

First, let's build the u-boot
#!/bin/bash
export PATH=/home/alexander/handhelds/armv6/codesourcery/bin:$PATH
export ARCH=arm
export CROSS_COMPILE=arm-none-eabi-
U_BOARD=omap5_uevm
make clean
make distclean
make ${U_BOARD}_config
make -j8

you'll get the u-boot.bin and the u-boot.img (which can be put to the SD card). Besides, that will build the mkimage tool that we'll need later.

Now, we need to create the boot script for u-boot that will load the kernel and the device tree file to RAM.

i2c mw 0x48 0xd9 0x15
i2c mw 0x48 0xd4 0x05
setenv fdt_high 0xffffffff
fdt addr 0x80F80000
mmc rescan
mmc part
fatload mmc 0:1 0x80300000 uImage
fatload mmc 0:1 ${fdtaddr} omap5-uevm.dtb
setenv mmcargs setenv bootargs console=ttyO2,115200n8 root=/dev/sda1 rw rootdelay=5 earlyprintk nosmp
printenv
run mmcargs
bootm 0x80300000 - ${fdtaddr}

Now, compile it to the u-boot binary format:
./tools/mkimage -A arm -T script -C none -n "omap5 boot.scr" -d boot.txt boot.scr

Building linux:

export PATH=/home/alexander/handhelds/armv6/linaro-2012q2/bin:$PATH
export ARCH=arm
export CROSS_COMPILE=/home/alexander/handhelds/armv6/linaro-2012q2/bin/arm-none-eabi-
export OMAP_ROOT=/home/alexander/handhelds/omap
export MAKE_OPTS="-j4 ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE"

pushd .
cd ${OMAP_ROOT}/kernel_omapzoom
make $MAKE_OPTS omap5uevm_defconfig
make $MAKE_OPTS zImage
popd

Now, we need to compile the DTS (device tree source code) using the dtc tool. If you choose to use the usbboot instead of u-boot, you can enable the config option in kernel and simply append the DTB blob to the end of zImage.
(Boot Options -> Use appended device tree blob to zImage)

./scripts/dtc/dtc arch/arm/boot/dts/omap5-uevm.dts -o omap5-uevm.dtb -O dtb
cat kernel_omapzoom/arch/arm/boot/zImage omap5-uevm.dtb > kernel
./usbboot -f; fastboot -c "console=ttyO2 console=tty0 rootwait root=/dev/sda1" -b 0x83000000 boot kernel

VirtualOpen

For userspace part, I've followed the manual from VirtualOpenSystems for versatile express. The only tricky part was building qemu for the ArchLinux ARM host, and the guest binaries are available for download.
http://www.virtualopensystems.com/media/kvm-resources/kvm-arm-guide.pdf

P.S. please share your stories on how you're using or plan to use virtualization on ARM

An update on OSX Touchscreen driver

2013-11-25T02:43:00.000+04:00

After playing with the HID interface in OS X, I have found out there exists an API for simulating input events from user space, so I've implemented the touchscreen driver using it.

One unexpected caveat was that you need to set up an increasing event number to be able to click the menu bar. Even worse, when launched from XCode, the app would hang if you clicked the XCode menu bar. If you click any other app or launch it outside XCode, everything's fine. Caused some pain while debugging.

Now I still want to implement the driver as a kernel module. On the other hand, the userspace version is also fine. I want to add the support for multitouch gestures, though I'm not sure it is possible to provide multiple pointers to applications.

<iframe width="560" height="315" src="//www.youtube.com/embed/ccMOPNel-2k" frameborder="0" allowfullscreen></iframe>

The code is at github https://github.com/astarasikov/osxhidtouch

You can get the binary OSX touchscreen driver for the Dell S2340T screen from the link below. Basically, you can replace the HidTouch binary with your own, and if you put it to "HidTouch.app/Contents/MacOS/HidTouch", you will get an app bundle that can be installed to Applications and added to the auto-start.
https://drive.google.com/file/d/0B7wcN-tOkdeRRmdYQmhJSWsta1U/edit?usp=sharing

Multitouch touchscreen support in OS X

2013-11-22T03:55:00.003+04:00

Hi there!

I happen to have a Dell S2340T multitouch monitor (quite an expensive toy btw) which has a touch controller from 3M. It works fine in Windows (which I don't have any plans to use), sort of works in linux (which is my primary work environment) and does not work at all in OS X (not a big deal but it would be cool to have it).

So I set out on the search for a driver and have figured out the following:

There is some old driver from TouchBase that sucks. I mean, it's using some stinky installer that pollutes the OS X root file system, it has the UI from 90s. More than that, it's not signed and does not support my touchscreen. Even adding the USB Vendor ID to its Info.plist did not fix a thing
There is some new trial driver from TouchBase that should work for most devices but is limited to 100 touches (a demo version). Well, I didn't want to grovel before them and sign up at their stupid site. And still, even if I patched the driver to remove the trial limitations, I would have to sign it with my certificate and there would be no legal way for me to distribute it on the internetz
There is a full version of the TouchBase driver that costs $100. Are they fucking crazy? On the other hand, I would do the same. See, home users don't care, multitouch displays are very rare, and people are used to pirating sofrware. But one could raise quite some money selling the driver to the workshops that build customized Apple computers (like ModBook) or car PCs.
There are some drivers for other touchscreens, but they're for the old single-touch devices

Doesn't look promising. Now, one may wonder "WTF ain't it working out of the box? It's a HID device, should work everywhere". Well, there are two problems:

Multitouch devices have the new HID event type (called HID usages) for the Digitizer class which the OS X does not support in IOKit. Actually, since it uses a new HID class (different to single-touch monitors), it is not even classified as a touchscreen by the OS (rather as a multitouch trackpad)
The touchscreen gets recognized by the generic HID driver as a mouse. And here comes the problem - when a finger is released, the device reports the pointer to be at the origin (point <0, 0>) which causes the cursor to be stuck at the top left corner

I decided to learn about the HID system in OS X and write a driver. I started with the sample code from the HID documentation and added some event debugging. I found out that the device reports the events both as a digitizer and as a pointing device. So for now I prepared a demo app in Qt5 that can recognize multiple touches and draw points in different colors when the screen is touched. Have a look at it in action:

The source code is at https://github.com/astarasikov/osx-multitouch-demo .

Well, looks like I should figure out how to turn this into a device driver. I will probably need to figure out how to enumerate HID devices based on their class and make a kext file (Kernel EXTension) for XNU (the kernel used in OS X) and then.. sell my driver for some $20, right?

Using a hybrid graphics laptop on linux

2013-11-12T23:33:00.000+04:00

Introduction.

I have a laptop with so-called hybrid graphics. That is, it has two GPUs - one of them is part of the SoC (Intel HD3000 GPU), the other one is the "discrete" PCIe Radeon HD 6630M from AMD. Typicaly, older models of dual-GPU laptops (and new Apple Macbook Pro machines) have a multiplexer that switches the output from the GPU to display. As the majority of modern laptops, my features a "muxless" combination of the GPUs - that is, the more powerful GPU does not have any physical connection to the display panel. Instead, it can write to the framebuffer memory of the less powerful GPU (or it can perform only the computational part of OpenGL/DirectX and let the primary GPU handle 2D drawing and blitting).

Linux support for the hybrid graphics if far from perfect. To be honest, it just sucks even now that such kind of laptop have been dominating the market for some 4-5 years already. With the recent advent of the "DRI-PRIME" interface the linux kernel and userspace now supports using the secondary GPU for offloading intensive graphic workloads. Right now to utilize the separate GPU, an application has to be launched with a special environment variable (namely, "DRI_PRIME=1") and the system is supposed to dynamically power-up the GPU when it is needed and turn it off to reduce power consumption and heating at other times. Unfortunately, the support for power management is still deeply broken. In this post I summarize my findings and scripts which allow to turn off the external GPU permanently to save power and increase the battery life of the laptop. I am not using the secondary GPU because OpenGL and X11 drivers for the Intel GPU are more stable, and the open-source Radeon driver/mesa does not yet support OpenCL (the computational API for GPUs).

vga_switcheroo

The kernel interface for controlling the power for the external GPU is called "vga_switcheroo" and it allows to power-down the GPU. Unfortunately, I have found out that my laptop (and most others) enable the GPU after a suspend-resume cycle. I think this behaviour is intentional, because ACPI calls should be used to control the power of the GPU. However, it confuses the vga_switcheroo which thinks the card is still powered down. Meanwhile, the card drains some 10-30 watts of power, effectively reducing the laptop battery life from 7 hours to 2 hours or so.

My first stab at this problem was a patch that forced the vga_switcheroo to disable all the cards that were not held by the userspace at resume. That did solve the problem for a while, but was hackish and never made it into the mainline kernel. However, it is still useful for linux kernels up to 3.9.
http://lkml.indiana.edu/hypermail/linux/kernel/1204.3/02530.html

kernel 3.10+

As of linux version 3.10, several changes were made in regards to the hybrid graphics power management. First is the introduction of the dynamic power management (in the form of clock control code and parsing the predefined power states provided by the OEM BIOS) for the Radeon chips. Second one is the change to the vga_switcheroo which allowed it to power down the card when unused without the user interaction. It does work to some extent, but the problem with the card being powered on after a sleep-resume cycle remains.

The problem is that now, when I manually disable the card via the vga_switcheroo, the PCI device is gone - it is removed and never appears again. The same behaviour could be exhibited on pre-3.10 kernels if one did issue the "remove" command to the DRM node (/sys/class/drm/card1/). Besides, my hack to the vga_switcheroo stopped working since these upgrades. Now, this did not make me happy and I set out to figure out the solution.

acpi_call

Linux powers off the radeon GPU using the PCIe bridge and GPU PCIe registers. A "portable" and "official" way to control the power of a GPU is using an ACPI call. ACPI is an interface for the Intel-X86 based computers (although now also implemented for the ARM CPUs in an attempt to provide the support for UEFI, Windows RT and an unified linux kernel binary capable of running on any ARM SoC) intended to provide abstraction of hardware enumeration and power management. It contains tables with lists of PCI and other PNP (plug-and-play) peripherals which can be used by the OS kernel instead of the unsafe "probing" mechanism. Moreover, it contains a specification for the interpreted bytecode language. Some methods are implemented by the OEM inside the BIOS ACPI tables to perform certain functions - query battery status, power up- and down the devices etc.

Folks have since long figured out to use ACPI calls in linux to control the power of the discrete GPU. Although it could interfere with the vga_switcheroo interface, in case we either completely disable the external GPU or power it off via vga_switcheroo first, we're safe to use it. Moreover, I have found out that I can use an ACPI call to power on the device and make it visible to the system after it is removed as described in the previous paragraph!

There exist a module for the linux kernel that allows to perform arbitrary ACPI calls from the userspace. Turns out that a direct ACPI call can even work around the new vga_switcheroo. There's a good guide on installing the acpi_call module into the DKMS subsystem so that it is automatically packaged and built any time you upgrade the linux kernel on your machine.

http://garyservin.wordpress.com/2012/01/06/disabling-discrete-gpu-in-debian-gnulinux-wheezy/

The module contains the turn_off_gpu.sh script in the examples folder which can be used to power down the GPU. I ran it and took a notice of the method used for my laptop - it was the "\_SB.PCI0.PEG0.PEGP._OFF" (which means South Bridge -> PCI controller 0 -> Pci Express Graphics -> Pci Express Graphics Port).

Now, I did the following magical trick:
echo "\_SB.PCI0.PEG0.PEGP._ON" > /proc/acpi/call
And BANG! after being gone when turned off via the vga_switcheroo, the GPU was identified by linux and enabled again. Neato.

Putting it all together.
Now the real showstopper is that the card still drains power after resume. I figured I had to write a script which would power down the card. It turned out that sometimes if the laptop was woken up and then it immediately went to sleep due to some race condition, systemd did not execute the resume hook.

Instead, I used the udev. Udev is a daemon that listens for the events that the kernel sends when hardware state changes. Each time the laptop wakes up, the power supply and battery drivers are reinitialized. Besides, the primary GPU also sends wakeup events. Another two observations are that I'm not using the discrete GPU and it is safe to call the ACPI "OFF" method multiple times. So, it is not a problem if we call the script too many times. I decided to just call it on any event from the two aforementioned subsystems. Here is how it can be done (note that you need to have root permissions to edit the files in /etc. If you are a newcomer to linux, you can become root by issuing a "sudo su" command):

Create the file "/etc/udev/rules.d/10-battery-hook.rules" with the following contents:
ACTION=="change", SUBSYSTEM=="power_supply", RUN+="/etc/gpu_poweroff.sh"
ACTION=="change", SUBSYSTEM=="drm", RUN+="/etc/gpu_poweroff.sh"

Now, create the /etc/gpu_poweroff.sh script with the contents mentioned below. You can uncomment the "echo" calls to debug the script and verify it is getting called. By the way, the power_profile part is not necessary but is an example of how to put a radeon GPU into a low-power state without disabling it.

gpu_poweroff.sh

#!/bin/bash

#this is a script for lowering the power consumption
#of a radeon GPUs in laptops. comment out whatever portion
#of it you don't need

#echo "gpu script @`date`" >> /tmp/foo

if [ -e /sys/class/drm/card0 ] ;then
for i in /sys/class/drm/card*; do
if [ -e $i/device/power_method ]; then
echo profile > $i/device/power_method
fi
if [ -e $i/device/power_profile ]; then
echo low > $i/device/power_profile
fi
done
fi

if [ -d /sys/kernel/debug/vgaswitcheroo ]; then
echo OFF > /sys/kernel/debug/vgaswitcheroo/switch
fi

acpi_methods="
\_SB.PCI0.P0P1.VGA._OFF
\_SB.PCI0.P0P2.VGA._OFF
\_SB_.PCI0.OVGA.ATPX
\_SB_.PCI0.OVGA.XTPX
\_SB.PCI0.P0P3.PEGP._OFF
\_SB.PCI0.P0P2.PEGP._OFF
\_SB.PCI0.P0P1.PEGP._OFF
\_SB.PCI0.MXR0.MXM0._OFF
\_SB.PCI0.PEG1.GFX0._OFF
\_SB.PCI0.PEG0.GFX0.DOFF
\_SB.PCI0.PEG1.GFX0.DOFF
\_SB.PCI0.PEG0.PEGP._OFF
\_SB.PCI0.XVR0.Z01I.DGOF
\_SB.PCI0.PEGR.GFX0._OFF
\_SB.PCI0.PEG.VID._OFF
\_SB.PCI0.PEG0.VID._OFF
\_SB.PCI0.P0P2.DGPU._OFF
\_SB.PCI0.P0P4.DGPU.DOFF
\_SB.PCI0.IXVE.IGPU.DGOF
\_SB.PCI0.RP00.VGA._PS3
\_SB.PCI0.RP00.VGA.P3MO
\_SB.PCI0.GFX0.DSM._T_0
\_SB.PCI0.LPC.EC.PUBS._OFF
\_SB.PCI0.P0P2.NVID._OFF
\_SB.PCI0.P0P2.VGA.PX02
\_SB_.PCI0.PEGP.DGFX._OFF
\_SB_.PCI0.VGA.PX02
\_SB.PCI0.PEG0.PEGP.SGOF
\_SB.PCI0.AGP.VGA.PX02
"

# turn off the dGPU via an ACPI call
if [ -e /proc/acpi/call ]; then
for i in $acpi_methods; do
echo $i > /proc/acpi/call
done
#echo "turned gpu off @`date`" >> /tmp/foo
fi

exit 0

Make it executable by issuing a "chmod +x /etc/gpu_poweroff.sh" command.
Also, take a look at the "/etc/rc.local" script. If it does not exist, simply create it with the following contents and make it executable:

#!/bin/sh -e

bash /etc/gpu_poweroff.sh

exit 0

If it does exist, insert the call to the "/etc/gpu_poweroff.sh" before the "exit" line.

Bonus

To silence the fans on the Sony Vaio laptop, you can add the following to your rc.local script (I advise against putting it to the gpu_poweroff script because this command has a noticeable delay):

if [ -e /sys/devices/platform/sony-laptop/thermal_control ]; then
echo silent > /sys/devices/platform/sony-laptop/thermal_control
fi