Friday, August 13, 2021

building CLVK OpenCL support for Android phones + OpenCV notes

Compiling CLVK for Android.

Many Android devices, especially Google Pixel, ship without the OpenCL library.
At some point I needed OpenCL for my OpenCV prototyping, and I was also interested in using either CU2CL or a similar project to run CUDA code.
Needless to say, as soon as I saw a project which promised to implement OpenCL on top of Vulkan, I decided to see if I can run it on Android.

It worked fine, although my approach was kind of nasty: integrating the project along with its LLVM library into the app code. That was good enough for prototyping, although the debug binaries took a few hundred megabytes.
I've added instructions to cross-compile this project for Android.
Additionally, I wrote a simple Android app to demonstrate how to integrate the pre-built
OpenCL library and how to deploy the OpenCL compiler ("clspv" binary) to the device.

Two example OpenCL apps are compiled: "clinfo" which prints some basic information and "BitonicSort" from Intel OpenCL demos.
FWIW both of them work so it's a good start.
https://github.com/astarasikov/clvk/tree/android_test

Also, it might be curious to compare the run times for this app on the phone with CLVK and on the desktop with native OpenCL drivers and with CLVK + RADV.
It seems that on desktop CLVK with RADV is 10 times slower than the native driver.

However, since RADV or any other Vulkan driver uses most of the same LLVM-based codegen as the native OpenCL drivers, this difference is very likely caused by some hardcoded allocation size or another similar parameter rather than some major architectural issue.
I have not looked into it yet for lack of time though.

  • ARM Mali G72 MP18: 96.818924 ms
  • Qualcomm Adreno 630: 33.18 ms
  • AMD RX480 with CLVK and RADV: 7.491112 ms.
  • AMD RX480 with ROCm OpenCL driver: 0.651121 ms.

https://github.com/astarasikov/clvk/blob/android_test/results.txt

OpenCL with CLVK on top of Android GL driver on a device with no OpenCL

OpenCL with CLVK on top of Android GL driver on a device with no OpenCL


Not a real fix, but that's enough to make most OpenCL samples run.


Building OpenCV with OpenCL support for Android.

For my personal project in 2016 I needed to check if it is possible to use the GPGPU accelerated version of OpenCV that is implemented using OpenCL on Android phones.

OpenCL SDK for Android.

Where to get the SDK? Welp. Build one yourself!
To make OpenCV recognize our SDK and build successfully, we need the following things:
  • OpenCV.mk - can be empty, but the OpenCV build system needs it to be present
  • Khronos OpenCL headers - can be gathered using OpenCL headers and CL-CPP SDK.
  • The loadable dynamic libraries - can be pulled from the device. Generally you can use one from any other device with the same architecture because the ABI and API is the same as it is defined by the OpenCL standard.
Here is how out "SDK" tree should look like. For the time being I only used the "armeabi-v7a" architecture, but one can also add the 64-bit binaries to the "arm64-v8a" directory. I've put up the "SDK" to GitHub, but only the header part (which are publicly available from Khronos). You will need to find the "libOpenCL.so" yourself. (If you're using Google Pixel with MSM8996, you can take the proprietary binaries from Xiaomi Mi5 or Zuk Z2 Pro).

.
├── OpenCV.mk
├── include
│   └── CL
│       ├── cl.h
│       ├── cl.hpp
│       ├── cl_d3d10.h
│       ├── cl_d3d11.h
│       ├── cl_dx9_media_sharing.h
│       ├── cl_dx9_media_sharing_intel.h
│       ├── cl_egl.h
│       ├── cl_ext.h
│       ├── cl_ext_intel.h
│       ├── cl_gl.h
│       ├── cl_gl_ext.h
│       ├── cl_platform.h
│       ├── cl_va_api_media_sharing_intel.h
│       └── opencl.h
└── lib
    └── armeabi-v7a

        └── libOpenCL.so

Building OpenCV with OpenCL support.

https://gist.github.com/astarasikov/9088745a49401fced5f1a3503b07e593

The most important change is to actually enable the OpenCL on Android in CMakeLists.txt:
OCV_OPTION(WITH_OPENCL "Include OpenCL Runtime support" NOT ANDROID IF (NOT IOS AND NOT WINRT) )
+OCV_OPTION(WITH_OPENCL "Include OpenCL Runtime support" (NOT IOS AND NOT WINRT) )
I also had to disable some warning flags and unsupported compiler options in "OpenCVCompilerOptions.cmake"

Compiler Options.
Extra debugging.
CMake options.

Please also see the following blog which describes building OpenCV with OpenCL, although not the case when you don't have a ready-made SDK: http://www.ysagade.nl/2014/11/02/opencv-android-setup/

OpenCL without root.

Many vendors who ship the phones with the official Android (that passes the CTS, compatibility test suite) do not ship the OpenCL drivers. Some vendors (most Chinese ones like such as Xiaomi and Zuk, and also Sony), however, do ship the drivers. If you have root on your phone or are building a custom firmware, you can just take the binaries from the other device's ROM and that's it.

Dynamic linking banned by Google?

Currently it seems that one can still use mmap() and mprotect() to write their own dynamic library loader, but this might get patched in future because Google is looking into both security and control.

What could we do then?
In principle, we could develop an application that would take a bunch of ".so" libraries and produce a single object file (.o) containing the data and code from all the libraries, with all dynamic symbols resolved. In other words, write a static compile-time linker.

The ultimate obstacle to this approach would be if device vendors re-worked the driver architecture in such a way that the OpenCL frontend would not be loaded to the application address space but a separate server. If that happens, the only way forward would be using a device that ships the OpenCL driver or building a custom firmware.

Package all the relevant ".so" objects directly to your application APK.
Paths. For this prototype, I manually edited the paths to the libraries using a hex editor. This part can be automated, but it was good enough for the proof of concept stage.
Essentially packaging a good half of the other phone's firmware into your app :)

Tweaking the application makefiles.

I edited Android.mk to specify the STL version and disable exceptions.

The fail.

OpenCL library loaded fine, but now we had two independent EGL/OpenGL contexts - one from GL and one from CL with no standard way of sharing textures between them.
(2020 edit) In retrospect, I could have hooked "open" routine to steal KGSL file descriptor from GL for CL, but still there might have been some globals shared between libraries.

Zero-copy memory sharing.

It should be observed that all modern SoCs have the shared RAM for the CPU, GPU and other units (such as the camera frontend). Moreover, on more recent models, there is guaranteed cache coherency between the units (typically maintained by the AXI bus). Android used the kernel-side API called ION to manage memory buffers shared between the devices.

It should be possible to map the GPU texture into the CPU memory using either the OpenGL extension or the underlying GraphicBuffer API which is a higher-level interface for ION. I may be looking into this option in future. For now, I've decided to use the phone that ships with the OpenCL driver.

One other direction to explore might be using the Java API. I was able to create a GraphicBuffer from Java using reflection https://gist.github.com/astarasikov/2ebd216fcafa389174b58c6d9913e397 .
Apparently John Carmack has managed to do it the other way round, using SurfaceTexture API ("SurfaceTexture -> Surface, pass the surface in an intent to the other process, turn that into an ANativeWindow -> EGLSurface") https://twitter.com/ID_AA_Carmack/status/776099691586998272 .

EDIT 2021:
You really should just use ImageReader class, that's it.

Further thoughts.

Sadly, it looks like there's not much we (the users/independent developers) can do to change the situation. Some projects like OpenCV indeed contain a lot of code that would be difficult to port to other languages, especially if you don't want to introduce additional bugs in the process.

Perhaps as a developer the best plan is to limit your app to the phones that support OpenCL out of the box or build a custom firmware for the phone to add the drivers. While this limits your app to a very narrow set of devices, it will allow to re-use existing OpenCL code and build a working prototype quickly which is crucial for many projects at the early stage.

P.S.
This line is from the blog draft in 2017 before CLVK came out :)

I think an interesting direction to explore would be to create the custom OpenCL/CUDA driver and runtime that would generate code in GLSL (for OpenCL ES with Compute Shaders) or Metal.

Wednesday, August 5, 2020

SVE-2019-15230: A bug collision

Researchers from Team T5 recently published their write-up on exploiting a bug in S-Boot and obtaining code execution in the Samsung Secure Bootloader (S-Boot).
This week, they're going to present it at the BlackHat 2020 conference.
Their write-up contains a lot of technical details and I recommend you to read it.

In their report, they say "Another security research team found this vulnerability at the same time and report it to Samsung. ID: SVE-2019-15230".
This one-man security research team was me.
As I described in the previous two posts, in 2019 I got myself a Samsung Galaxy S10 phone with an Exynos SoC and decided to hunt for security bugs.
After finding the first issue (which is also my first SVE and my first report rated "Critical"), "SVE-2019-14371", I decided to carefully review the code around the location where I found the first bug.

I found an integer overflow which could potentially lead to memory corruption, overriding the entirety of S-Boot code and data.


Funnily enough, I got the SVE even though I have not submitted the POC which achieves code execution (well, to be fair, I reported it almost two months earlier).
I have submitted the one which demonstrates that the device handles non-underflowed values correctly whereas a "huge" buffer size causes it to freeze.
I came to the conclusion it's hard to exploit because I could not find a device with the good memory layout.



I have downloaded a ton of images: for S10, S9, S7, A50, J series phones.
Team T5's trick was to find a condition where the error handler code will make the memory layout good.
I have unfortunately overlooked that in S8 the download buffer is right before the S-Boot code AND it was not using the newer ("compressed" or "smp") download modes.
Since I never realized how to make that S-Boot falls back to the "legacy" buffer at 0xc0000000, I was focusing on the first underflow here, and came to the conclusion that there's not much I could do about it.

I wrote in my report to Samsung that USB transfer is done with DMA and I have not seen S-Boot initialize SMMU so it's surely exploitable.
If we could make the buffer point before S-Boot, of course (which I have not found out how to do).
I like how Team T5 discusses that normally it would be hard to exploit the bug as most code and data could be cached, but as they found a bunch of pointers in an uncached area, overwriting them works, even if it's done by the USB controller (which is not necessarily cache-coherent) and the CPU is unaware.

Although it was already the third Critical issue I found in S-Boot by that time and I was completely burnt out on trying to develop yet another POC.

Is the security of Samsung phones bad? NO.

I think, it's quite on par with the competitors. While the Android Security Updates page (https://security.samsungmobile.com/securityUpdate.smsb) regularly lists "High" and "Critical" issues, very few of them happen in the S-Boot bootloader (which is one of the earliest pieces of code which execute on the device). What it means is that the attacker need to first unlock the phone. So keeping your phone locked and BT/WLAN off should give a reasonable level of protection in the settings when you can't keep an eye on your phone.

There are of course things that could be improved, like adding some mitigations to the bootloader (stack cookies, heap guard pages). However, in this case this would not help at all, because the memory copy (and thus overwriting code/data) is not done by the S-Boot code, but by the USB DMA controller.

Bootloaders almost never implement ASLR, and even kernels which do only implement it for the virtual memory, the physical address remains constant or predictable. In fact, as it's overwriting uncached code (exception handlers), it could even work if the CPU supported ARM MTE. So this is in some sense the mother of all S-boot bugs.

In some aspect, the root cause here is very similar to the IROM bug found by Frederic Basse: an integer overflow and copying data from USB, although in case of IROM it seems that IROM's code is copying USB data by small chunks.

Of course it would not be fair to judge the design decisions now that we know about these bugs. It would be nice to add some checks to ensure DMA regions don't intersect with code/data. Enabling SysMMU would not hurt. This is becoming somewhat worrisome now that USB4 has been announced with Thunderbolt-like DMA capabilities. It's unfortunate that most bootloaders do not focus on this.

Now, the problem is that adding mitigations is quite hard as well as reasoning about their effectiveness. Without memory safety you can never be sure that the code is not exploitable. And as we've just seen before, it's unlikely that even a hardware safety oriented at memory safety are likely to be bypassed. Maybe a combination of MTE/KASAN and instrumenting all DMA memory management would work, but again it relies on the individual developers thinking of all corner cases. At some point bootloaders/firmwares could become as complex as the Linux kernel itself.
To this end, an interesting approach is moving firmware update, including USB download, to user-space and indeed running it from Linux, as Google started doing recently.

Why are the bugs present then?

It often happens that bugs appear in two areas: the just-written code with new functionality (which not so many people had a chance to review) or the old code (people got tired of trying to find bugs there and gave up).

I think we can draw two conclusions from this.
  • As a vendor, one should not assume that security or code review is a one-off effort and they need to re-review their stuff once in a while, especially bringing in the perspective of new team members.
  • As the user, if you expect maximum security, perhaps don't switch to the new tech immediately, give it 2-3 months for the most obvious and annoying issues (not only security bugs) to get ironed out.

What can I do to protect myself?

Enable the "Find My Mobile" feature of your Samsung phone.

Before I and TeamT5 reported a few issues to Samsung, S-Boot implemented one mechanism of preventing the phone from being tampered with: when MDM (device administration by a workspace or school) is enabled, certain partitions were not allowed to be flashed.
This, unfortunately, did not protect the user in two scenarios: when they were not using MDM (which is the majoriy of users) or when the vulnerability is in the USB stack before the flashing code (in which case even enabling MDM would buy you nothing).
In mid-2019, Samsung changed the "Find My Mobile" in such way that it would disallow any USB operations (such as ODIN mode) when the device is not unlocked and FMM is enabled.
This should provide a reasonable level of protection for most users.
It is now much harder for an attacker to hack your phone in a few seconds when you're not looking at it (such as at a conference or in the hotel room). Of course, if is still possible to re-program the storage chip by desoldering it, which could expose other potential vulnerabilities, but it's hard to do it quickly and without tamper evidence.

Will you blog about other findings?

Not sure. It's a tricky question.
In general, Samsung is asking to not disclose the issues, at least until the patching and reward process is done, which is understandable. I am grateful that Samsung allowed me to participate in their security reporting program, even though I work for another mobile SoC company, so I'm not particularly interested in making this relationship go sour.

I do, however, see the immense value in describing the bug contents, because personally for me write-ups on Phrack or later Google Project Zero have been useful for understanding how attackers think, which came handy when both writing code and later working as a security engineer.

I guess vendors can have their own reasons to not like when bugs are disclosed. Before working as a security engineer myself, I thought negative PR/news was the major reason. Turns out, it's impossible to predict this factor, and often the minor bugs are over-hyped but serious ones go unnoticed. So maybe this factor is not that important after all.

Another thing I've noticed is that there is a significant interest in some hacking and "mobile repair" forums in bugs even for older firmware revisions, which means there are regions where people rarely update (expensive internet) or... a source of phones which for some reason remain unused and therefore not updated for a while.

Given that mobile phones are supported (security updates and carrier contracts), maybe half of this term (1.5-2 years from the time the bug is patched) is a reasonable delay for holding off disclosure.

Tuesday, May 5, 2020

On Samsung and Exynos hacking, again

Introduction.

Last year I published a post (http://allsoftwaresucks.blogspot.com/2019/05/reverse-engineering-samsung-exynos-9820.html) about reverse-engineering TEEGRIS and S-Boot
on Samsung Exynos Galaxy S10. This is kind of a follow-up to that post
which has received a lot of attention and led to interesting conversations
with fellow security researchers.

Funnily enough, this very blog with its distinctive URL got into academic papers and
conference talks.
I guess that counts as a success because that's more citations than all my
previous academic work combined. Slowly but steadily I'm progressing on track to receive my PhD from the Shitposting University.

Citations.

(All links have been retrieved on 2020-04-17).
https://gsec.hitb.org/materials/sg2019/D2%20-%20Launching%20Feedback-Driven%20Fuzzing%20on%20TrustZone%20TEE%20-%20Andrey%20Akimov.pdf
Andrey Akimov: Launching feedback-driven fuzzing on TrustZone TEE (HITB GSEC 2019 Singapore).

https://zeronights.ru/wp-content/themes/zeronights-2019/public/materials/5_ZN2019_andrej_akimovLaunching_feedbackdriven_fuzzing_on_TrustZone_TEE.pdf
Andrey Akimov : Launching Feedback-Driven Fuzzing on TrustZone TEE (ZeroNights 2019)

https://blog.quarkslab.com/a-deep-dive-into-samsungs-trustzone-part-1.html
Alexandre Adamski, Joffrey Guilbon, Maxime Peterlin of Quarkslab : A Deep Dive Into Samsung's TrustZone (Part 1)

https://www.usenix.org/system/files/sec20summer_harrison_prepub.pdf
Lee Harrison, Hayawardh Vijayakumar, Rohan Padhye , Koushik Sen , and Michael Grace: PARTEMU: Enabling Dynamic Analysis of Real-World TrustZone Software Using Emulation

https://www.ndss-symposium.org/wp-content/uploads/2020/04/bar2020-23014.pdf
Marcel Busch and Kalle Dirsch : Finding 1-Day Vulnerabilities in Trusted Applications using Selective Symbolic Execution

Follow-up on reverse-engineering and security research.

I also found a few bugs in TEEGRIS and S-Boot that got assigned CVEs by Samsung (check 2019/2020 here).
I'm somewhat happy about this achievement. Prior to that, I mostly worked on
the defense side both implementing mitigations/OS kernels and then debugging
security issues submitted by other researchers. So I was glad to receive this
external validation of my ability to find bugs on my own, although a little
bit surprised at how easy it was to find them with the code review of the
code decompiled with Ghidra.

I have not really found any bugs with fuzzing using the QEMU emulators for
S-Boot and TEEGRIS described in my previous blog post. However, these came
handy for debugging proof-of-concepts as I could use GDB and dump memory as if
it was just a regular Linux app on the PC.

I would also like to point your attention to this paper on Phrack
about emulating RKP (Samsung Hypervisor) with QEMU by Aris Thallas.
http://phrack.org/papers/emulating_hypervisors_samsung_rkp.html

I have used a similar approach with full-system QEMU emulation for debugging some RKP bugs.
However, after having spent so much effort on emulating S-Boot and TEEGRIS,
I was not in the mood to boot Linux in EL1 and put all the pieces together.
I used a different approach for testing Hypervisor Calls (HVCs). Instead
of having a proper EL1 client, I wrote a piece of C code that invoked the
EL2 exception handler directly. I then linked it to the address of some
uninteresting function in RKP and used GDB to overwrite the code in QEMU
memory and jump to my stub.

I especially like the part about using QEMU instrumentation to provide
coverage information to AFL.
I have also implemented a similar approach (based on the QEMU and Unicorn modes
from the AFL source tree) for my TEEGRIS QEMU emulator.
https://github.com/astarasikov/qemu/commits/teegris_usermode_persist_rewriteafl
https://twitter.com/astarasikov/status/1187902865710428160


Unfortunately, I have not found any bugs with fuzzing (although I have with code review).
I believe better results could be achieved with the CompareCoverage plugin which
would prevent the fuzzer from getting stuck on magic values/constants.
https://andreafioraldi.github.io/articles/2019/07/20/aflpp-qemu-compcov.html
Additionally, please check out this blog about implementing ASAN (Address Santiizer)
for binary-mode QEMU within the TCG interpreter/JIT.
https://andreafioraldi.github.io/articles/2019/12/20/sanitized-emulation-with-qasan.html

Finally, if you're interested in fuzzing at the source-code level and are
getting stuck with magic values/constants, please check out this
post from 2016 about a strategy for splitting comparisons (which is related
to CompareCoverage).
https://lafintel.wordpress.com/

This is already implemented in libFuzzer, but
if you have to use AFL, consider using AFL++ which maintains LLVM plugins
for these strategies. In any case, check out AFL++ because it attempts to unify
most of the forks developed in academia.
https://github.com/AFLplusplus/AFLplusplus

Other interesting news.

I9100 (Samsung Galaxy S2) upstream work.

I was surprised when I got a GitHub notification in 2020 about a project I have
not worked on before. Turns out, people have been resurrecting the work I've
done in back 2012 which was a nice surprise.

In 2012 I was doing some work on getting FOSS software to run on
the Samsung Galaxy S2 phone. It was a hobby project, I got this phone
after completing my work on porting Linux and Android to Sony Xperia X1 and
hoped that starting with a device which ran Linux out of the box would be
advantageous for this goal.

So the first problem that I solved was getting multi-boot working.
I solved it by porting the U-Boot bootloader.
This eventually related in a weird chain of events that landed me several interesting
jobs and gigs.

Anyway, the u-boot.


I then attempted porting the Galaxy S2 board support to the mainline kernel tree.
I was using the latest Linaro tree. I had some limited success in getting most
hardware working with upstream drivers (WIFI, Camera with V4L2) and by porting
some non-upstream ones (Sound, Modem).
https://github.com/astarasikov/i9100-proper-linux-kernel/commits/i9100_linaro_33

Eventually I had to resort to using the Android kernel with some changes
but I got dual-boot working with Ubuntu on the SD Card.

Native Ubuntu (with X11) on Samsung Galaxy S2 (2012)
https://www.youtube.com/watch?v=VHl8PytVt50

Back in 2012 I made a post to summarize my efforts related to S2.
https://www.mail-archive.com/smartphones-userland@linuxtogo.org/msg02865.html

Mainline linux port by Sekil

Fast-forward to 2020, I was surprised to learn that not only people are still
using the device, they are also using my U-Boot port and one developer even
went as far as resurrecting the attempts to run mainline linux tree.
They made great progress and independently authored patches for the mainline
tree which have a high chance of being accepted.

See this port by Evgeniy Stenkin.


This effort is acknowledged and is used by the PostmarketOS project.
https://wiki.postmarketos.org/wiki/Samsung_Galaxy_SII_(samsung-i9100)

FOSS RIL for Samsung Galaxy S2, Galaxy Nexus

Later, my focus switched to reverse-engineering the userspace libraries
in order to provide a fully open-source build of Android for Samsung Galaxy Nexus,
a device which shared the modem with Galaxy S2.

For the previous-generation phone (Galaxy S1, I9000) an open-source implementation
of the Radio Interface Layer (RIL) was provided by the engineers from the Replicant
and OpenMoko projects (Paul Kocialkowski, Simon Busch and morphis).

In 2012 I was asked by Ksys Labs to provide an open-source RIL for Samsung
Galaxy Nexus which happened to have the same modem as Galaxy S2.
So I have done the following:

  • Firmware loader for these modems (based on reverse-engineering and a C++ implementation by another engineer)
  • Fixing SMS character encoding so that we could receive SMS in Russian
  • Fixing some edge cases for USSD support
  • Providing some rudimentary socket callback protocol so that a proprietary GPS library could be used by those who really wanted to.

These changes have been fully integrated into the Replicant project
and served as the basis for supporting many more Samsung modems.
Some builds of LineageOS for Galaxy S3 also use these libraries from the
Replicant project to avoid the overhead of supporting the ABI for the
proprietary driver libraries from 2012.


I even saw the Replicant stand at the CCC last year so these phones
are living on.
And the dream of supporting it in a non-Android setting such as Ofono
seems to have never materialized. Oh well.

Summary

I am happy to see that my work on both U-Boot and RIL got reused by many projects.
Back in 2012 having your phone run upstream software was a very ambitious goal,
especially for a single developer. It usually took around a year and a half
to get familiar with all the hardware and reverse-engineer it to a decent level
in order to develop all the support by which time the device would get obsolete.
However, if you're more interested in using upstream SW than using the latest
HW, there is some hope.

Oh, and Pinephone looks like a nice alternative these days. The hardware is similar to Galaxy S2, but the CPU is 64-bit and it's FOSS out of the box.

U-Boot without the proprietary bootloader.

Here's another interesting development that happened in those years to another
related Exynos device (Galaxy S3 I9300).
Simon Shields ported U-Boot to Galaxy S3, but unlike my port this one
does not rely on the Samsung bootloader in any way and allows to boot the phone
with even fewer proprietary components.
https://blog.forkwhiletrue.me/posts/u-boot-on-galaxy-s3/

Back when I was porting U-Boot to S2, I flashed it into the Linux kernel
partition and made it so that it's loaded by the phone's original bootloader.
My motivation was to avoid bricking the device (back when it was not known
how to use Exynos USB recovery mode) and it was assumed that the bootloader
needed to be signed. As it turned out later around 2014, on these early
Exynos chips the initial bootloader shared the same signing key and device
ID with development boards and it was possible to work around the signing
requirement and replace the original bootloader by using the stage-1 bootloader
from a development board.

KVM on the phone.

Ever since working on the ARM para-virtualization with L4/Genode I wanted
to use real virtualization.
I was very enthusiastic about the first (32-bit) ARM boards with the HYP extension
when they arrived in 2013.
http://allsoftwaresucks.blogspot.com/2013/11/kvm-on-arm-cortex-a15-omap5432-uevm.html

Since then, I've always wanted to get virtualization working on a mobile phone
for the fun of running multiple operating systems.
Unfortunately, most of them enable "secure" booting and require that the EL2
hypervisor image is signed by the OEM.

Some early phones did not implement a hypervisor or left it writable by the OS
but I was wondering if I could do that on a fairly recent and powerful phone.

Here's some small showcase of an attempt to run Windows 10 in KVM on a Samsung
A50 phone with the Exynos9610 CPU.

The bug I found works only on the unlocked phone (with KNOX tripped/fuse blown) before Linux MMU
is on. In principle one might be able to find a variant that works with MMU on,
but even passing arbitrary arguments to RKP would require compromising (rooting)
Linux first. Therefore, this bug does not (IMHO) have a big security impact
(because on older generation Exynos RKP/EL2 was only used for the kernel
memory protection and ROPP/JOPP but not for IOMMU) but is interesting for research purposes.

This is in no way a statement on the security of Samsung devices. I think
their efforts are definitely above average for Android. However, given enough time
any system can be broken, even the ones previously regarded as unbreakable such
as PS4 or iPhone with PAC. Here, patching timely before the issues get disclosed
is important and looks like things have improved a lot in the Android world recently.
https://www.zdnet.com/article/android-oem-patch-rates-have-improved-with-nokia-and-google-leading-the-charge/

The bug has been patched in October 2019 anyway so users with the latest updates
should not be affected (SVE-2019-15221, SVE-2019-15143).

What I've also learnt from watching a lot of talks and following the discussions
by other researchers is that security issues often concentrate in two areas:
where no one has looked before, and where many people have looked and then gave
up because they decided that they found all the low-hanging fruits. So RKP seemed
like an interesting target given the previous research from Google Project Zero
in 2017 (https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-bypassing-samsungs.html).

I will not be providing additional details on that bug but here are some nice
screenshots and videos:

Ubuntu X11 running on Samsung Galaxy A50. KVM guest runs Windows 10.
Here, we can see that the colors are swapped as the framebuffer driver is confiruged
to output BGR instead of RGB by default in Android.

Video of UEFI booting Windows 10 installer in KVM.
https://twitter.com/astarasikov/status/1249904283098796033

A mysterious BSOD (yes it's actually supposed to be blue) in the USB controller
driver, possibly related to how the controller is emulated in QEMU.




Unfortunately for now I had to stop further work on this project because I accidentally
upgraded the phone to the latest firmware revision and now due to the rollback protection
I can no longer install the vulnerable RKP image.

If you're interested in this kind of stuff, there are good news.
Recently a few open-source phones have appeared which do not enforce secure boot/
signature verification and you can run KVM (or any other hypervisor) out of the box.

For example, multiple people have reported getting KVM and Windows 10 working
on the Pinephone and Pinebook.
Pinephone has a Cortex-A7 CPU with an old Mali GPU so in terms of hardware
it's almost an exact copy of the Galaxy S2 discussed above, but it's more
FOSS-friendly.

https://twitter.com/RealDanct12/status/1231607283412426756
https://twitter.com/Manawyrm/status/1197981073101271040


Wednesday, May 29, 2019

Reverse-engineering Samsung Exynos 9820 bootloader and TZ

Reverse-engineering Samsung S10 TEEGRIS TrustZone OS

It's been a while since my last post, huh?
Even though I have quite a lot of stuff I'm planning to write about, time is very limited.

Lately I've been working on reverse engineering and documenting
the S-Boot bootloader and TrustZone OS from the Exynos version
of Samsung Galaxy S10.
TLDR: I can now run S-Boot and TEEGRIS TrustZone TAs in QEMU but too lazy to find bugs.

It's been a while since I had a Samsung phone, my last was Galaxy S2.
It's also been a while since I last looked into bootloader binaries.

Last year I got an Exynos S9 model, mostly because I was impressed by its
CPU benchmark scores and wanted to run my own code to measure it.
This year I got some spare time but since S10 came out and a lot of people
have already looked at S9 software, I've decided to start reverse engineering
the software from S10.

S-Boot bootloader image layout.

github gist


  • 0x0: probably EPBL (early primitive bootloader) with some USB support
  • 0x13C00: ACPM (Access Control and Power Management?)
  • 0x27800: some PM-related code
  • 0x4CC00: some tables with PM parameters
  • ... -> either charger mode code or PMIC firmware
  • 0xA4000: BL2, the actual s-boot
  • 0x19E000: TEEGRIS SPKG (CSMC)
  • 0x19E02B: TEEGRIS SPKG ELF start (cut from here to load into the dissasembler). This probably stands for "Crypto SMC" or "Checkpoint SMC". This handles some SMC calls from the bootloader as part of Secure Boot for Linux.
  • 0x1ACE00: TEEGRIS SPKG (FP_CSMC)
  • 0x1ACE2B: TEEGRIS FP_CSMC (ELF header). My guess is that it's related to the Fingerprint sensor because all it does is set some registers in the GPIO block and USI block (whatever it is).
  • 0x264000: TEEGRIS kernel, relocate to 0xfffffffff0000000 to resolve relocations
  • 0x29e000: EL1 VBAR for TEEGRIS kernel. fffffffff0041630: syscall table, first entry is zero.
  • 0x2D4000: startup_loader package
  • 0x2D4028: startup_loader ELF start. This one's invoked by S-Boot to read the TEEGRIS kernel either from Linux kernel via shared memory or from the LZ4 archive compiled into S-Boot.

There's also one encrypted region containing ARM Trusted Firmware which is EL3 monitor code. It's right after the bunch of Rijndael substitution box constants.

Running S-Boot in QEMU.

I've long wanted to run S-Boot in QEMU for reverse engineering it.
I think I've mentioned this idea to my colleague Fred 2 years ago which kind of motivated him to write this great post about Exynos4210 early bootloader in SROM.
Check out his blog if you're interested in Samsung, btw.
https://fredericb.info/2018/03/emulating-exynos-4210-bootrom-in-qemu.html

Long story short, with a bit of hacks to emulate some MMIO peripherals I've prepared the patch for QEMU to run S-Boot from Exynos9820.
QEMU Support for Exynos9820 S-Boot

SCTLR_EL3 register

According to ARM ARM, top half of SCTLR is Undefined.
Samsung reused them to store the base address for the S-Boot bootloader.
When running in EL3, part of SCTLR is used when computing the value to write to VBAR registers which point to the Exception Table.
I initially attempted running S-Boot in EL3 but it checks EL at runtime and I believe it's actually running at EL1 but the binary supports EL1, EL2 and EL3.

Re-enabling debugging prints

Turns out, early in the boot process the bootloader disables most of the debugging logging.
I've prepared the GDB script to work around that.
gdbscript
set *(int*)0x8f16403c = 0

UART

https://github.com/astarasikov/qemu/blob/exynos9820/hw/arm/virt.c#L1900
As usual (WM5 blog [http://allsoftwaresucks.blogspot.com/2016/10/running-without-arm-and-leg-windows.html]), we can solve it by making the MMIO Read request return different data on subsequent reads.
We simply invert the value in cache on each invokation.
Using this trick we can bypass busy loops which wait for some bits to be set or cleared.

In fact, emulating two UART registers, status and TX, is enough to get debugging output from the bootloader.

Peripherals

We can identify some peripherals either by looking up their addresses in Linux Device Tree files
or by analysing what is done by the code that accesses them.
For example, we can easily identify Timer registers.

EL3 Monitor emulation.


S-Boot calls into the Monitor code (ARM Trusted Firmware) to do some crypto and EFUSE-related operations.
These calls have argument numbers starting with a lot of FFFFFF.
It was necessary to enable the "PSCI conduit" in QEMU which intercepts some SMC calls and add a simple
handler to allow S-Boot to properly start without crashing.
arm_is_psci_call
if ((param & 0xfffff000) == 0xfffff000) {
//Exynos SROM 
return true;

Putting all the pieces together: running it.

./aarch64-softmmu/qemu-system-aarch64 -m 2048 -M virt -serial stdio -bios ~/Downloads/sw/pda/s10/BL/sboot_bl2.bin   -s -S 2>/dev/null

At this point, we're not emulating most peripherals like I2C, PMIC, USB.
However, the bootloader gets to the point where memory allocator and printing subsystem is initialized which should be enough
to fuzz-test some parsers if we hook UFS/MMC access functions.


General approach to reverse-engineering

Samsung leaves a lot of debugging prints in their binaries.
Even in the RKP hypervisor, although most strings are obfuscated by getting replaced with their hashes,
some strings in the exception handler are not obfuscated at all.
With this knowledge, it's easy to identify the logging function, snprintf
and then strcpy, memcpy. Memcpy and strcpy are often near malloc and free.
Knowing this functions it's trivial to reverse-engineer the rest.

TEEGRIS intro

In the Exynos version of Galaxy S10, Samsung have replaced
the TrustZone OS from MobiCore with their solution called TEEGRIS.

As we've seen before, TEEGRIS kernel and loader are located inside
the BL image along with S-Boot.
Userspace portion - dynamic libraries and TAs (Trusted Applications)
reside in two locations:

  • System partition ("/system/tee"):
  • A TAR-like archive linked into the Linux Kernel
Here is what we can find:

  • 00000000-0000-0000-0000-4b45594d5354 (notice how 4b 45 49 4d 53 54 are ASCII codes for "KEYMST" (Key Master))
  • 00000000-0000-0000-0000-564c544b5052 VLTKPR (Vault Keeper)
  • 00000005-0005-0005-0505-050505050505 - TSS (TEE Shared Memory Server?)
  • 00000007-0007-0007-0707-070707070707 - ACSD (Access Control and Signing Driver?) basically the loader for TAs with a built-in X.509 parser 


I wrote a Python script to unpack the (uncompressed) TZAR files.
https://gist.github.com/astarasikov/f47cb7f46b5193872f376fa0ea842e4b#file-unpack_startup_tzar-py
After unpacking the file "startup.tzar" from S10 kernel tree (LINK)
we can see that it contains a bunch of libraries as well as two TEE applications
which can be identified by their file names resembling GUIDs.

Security mechanisms


  • Boot Time: TEEGRIS kernel and startup_loader reside in the same partition as S-Boot so their integrity should be checked by the early bootloader (in SROM).
  • Run Time: TrustZone applets (TAs) are authenticated using either built-in hashes or X.509 certificates.
  • Trustlets and TEEGRIS kernel has stack cookies and they are randomized.

All TAs are ELF files which export the symbol "TA_InvokeCommandEntryPoint" which
is where requests from Non-Secure EL1 (and other Secure EL0 TAs) are processed.
Additionally, some extra TZ applets can be found in the "system" partition.

Indentifying TEEGRIS syscalls

Attempt 1 (stupid)

Look for the syscall number and a compare instruction.
For example, for the "recv" syscall, let's search for 0x38, filter results by "cmp".
No Luck. Ok, it's probably using a jump table or a function pointer array instead.

Attempt 2

Let's locate AArch64 exception table and go from there.
We can find it by a bunch of NOPs (1f 20 03 d5) immediately after a block of zero-filled memory.
We can then find the actual exception handler for EL0 by knowing the offset from the ARM ARM.
https://developer.arm.com/docs/100933/latest/aarch64-exception-vector-table








P.S.
In fact, the code which launches "startup_loader" sets VBAR_EL1 to the same
address which we've identified before.

Syscalls

Luckily for us, Samsung put wrappers for each syscall into the library called "Libtzsl.so"
so we can easily recover the syscall names from the index in the table.

TEEGRIS IPC

Curiously, Samsung chose to implement two popular POSIX APIs to communicate
between TAs as well between TAs and REE (Linux): "epoll" and "sendmsg/recvmsg".

Peripherals such as I2C and RPMB are of course handled by file paths with magic
names, like on most UNIX-like kernels.

List of (most) TEEGRIS syscalls

https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11590

TEEGRIS emulator

Since I'm better at reverse engineering than at exploitation
and I like writing emulators but hate code review, I decided to
find a way to run TAs on the Linux laptop instead of the actual
device.

Besides doing full-system emulation, QEMU supports the "user" target.
In this case it loads the target ELF binary into memory and translates
instructions to the host architecture, but instead of blindly passing
syscall arguments to real syscalls it can patch them and do any kind of
emulation.

Here are the changes that I needed to make in order to run TEEGRIS binaries instead of Linux ones:

  • ELF Entrypoint: setup AUXVALs in a specific order that "bin/libtzld.so" expects
  • Slightly different ABI: register X7 is used for the syscall number for both ARM32 and ARM64
  • https://github.com/astarasikov/qemu/blob/teegris_usermode/linux-user/syscall.c#L11785
  • TLS handling (QEMU bug?)

Current Status.


  • Boots TAs, both 32-bit and 64-bit
  • Currently does not support launching TAs from TAs (thread_create)
  • Currently only invalid command handler is reached. Need to improve
  • recvmsg or patch the library code as a workaround.
  • But overall it should be possible to build a fuzzer for TAs in less than a week of work now.



Here's one idea: now that we know we've emulated enough of syscalls
for a TA to boot and start message processing, we can just override the
return address and arguments for one of the syscalls which are invoked
in the message processing loop and redirect the execution directly
to TA_InvokeCommandEntryPoint.

For this proof of concept I've manually identified the entry point address
and adjusted it according to the ELF base load address and QEMU-specific load
offset. Of course it would be better to automate this part so that TA loader
is more generic but as every software engineer knows, those who write
good code get scooped by those who don't.



This kind of works in that we're getting messages from inside the TA: check the full log at https://pastebin.com/sVtWk5CD and search for "keymaster [ERR]".
However, it fails early when validating the message contents.
We need to generate the correct ASN.1 payload which should be doable
since ASN.1 grammar templates are compiled into the binary.

Ideas for future research


  • Hook malloc/free and some other functions and invoke native system C library calls.
  • Hook QEMU JIT (TCG) or interpreter to check memory accesses against ASAN shadow memory. This way we can enable Address Sanitizer for binary blobs, similarly to how Valgrind does memory debugging. Since QEMU Usermode runs TAs in the same address space as itself, we can use ASAN allocator or libdislocator to detect OOB memory access. Unicorn is kind of hard to use because for this because it does not allow to easily set up MMIO traps, it only allows to register chunks of normal memory.
  • Finish reverse-engineering ASN.1 format for Keymaster and fuzz this TA.
  • Run TEEGRIS kernel in QEMU as well to fuzz syscalls.
  • A Ghidra script to rename functions according to the debug strings passed to invokations of print callees
  • Look at the ring buffer implementation in the shared memory.

Running TEEGRIS Emulator

export TEE_CMD=777
qemu/teegris$ ../arm-linux-user/qemu-arm -z fuzz_keymaster/in/test0.bin -cpu max ./00000000-0000-0000-0000-4b45594d5354.elf

Debugging panics with GDB


Related Projects

Post from Daniel Komaromy on reverse-engineering Galaxy S8 which mostly focuses on the other part of the picture: getting from Linux into Secure EL0.


Blog from Blue Frost Security on reverse-engineering S9 TrustZone. The OS kernel is different but actual TAs are the same.

Monday, October 24, 2016

running without an ARM and a leg: Windows Mobile in QEMU

Introduction.

One project I had in mind long time ago was getting Windows Mobile to run in QEMU.
I think it's a lovely OS with a long history and the project seemed like a nice tecnhical challenge.

Initially I started working on it two years ago back in 2014 and the plan was to later run it in KVM on Cortex-A15 with Virtualization Extensions. However, I had to suspend it because I started working on two other challenges - running XNU on Xen (aka Virtu.al LLC) and later doing GSoC (running FreeBSD in ARM emulator).

Now since I've got some free time on my hands, I decided to finally get back to this project and cross it off my TODO list.

Choosing the emulation target.

In order to run Windows CE on QEMU (or any OS for that matter) it would be necessary to either develop a Board Support Package with all the drivers for a specific virtual machine or take the opposite approach and emulate some machine for which there already exists a ROM image.

For Windows, there is the emulator developed by Microsoft which is unsurprisingly called just Device Emulator. It emulates a real board - MINI2440 based on Samsung S3C2440 SoC which is an ancient ARMv4 CPU. Turns out, this is the same SoC that's used in OpenMoko so there is an old fork of QEMU with the support for most of the peripherals. So the choice of the platform seemed a no-brainer.

So I took the QEMU fork supporting MINI2440 and tried to adapt it to running the unmodified Windows Mobile images from Microsoft. Needless to say, I made sure the images are placed into memory at the correct addresses but the code seemed to crash spontaneously and never got past enabling MMU.

The first idea that comes to mind is of course to take the latest QEMU and see if it fixes anything. However, trying random changes until something works is actually quite a crappy approach.

So, I decided to single-step the execution and see what happens. QEMU provides the very useful GDB interface (which can be activated with the "-s -S" switches) for this purpose.

Caches are hard.

The first surprise came from the bootloader code. Before launching the kernel, Windows CE bootloader disables the MMU. At this moment QEMU crashes spectacularly. Initially I tried hacking around the issue by adding the code to translate the addresses into the "exec-all.h". However, it didn't solve the problem and the heuristic started looking too complex which suggested I'm on the wrong way.

I started thinking of and realized that disabling MMU is a tricky thing because on ARM Program Counter is usually ahead of the current instruction so the CPU has to fetch a couple instructions ahead. So we have a caching problem. In this Windows CE code, there is a NOP instruction between disabling the MMU and jumping to the new PA from a register. The hypothesis was that QEMU did not fetch the needed instructions correctly and it was necessary to add a translation cache for them. The reality was more funny.

As it turned out, QEMU contained a hack to cache a couple instructions because... because Linux kernel for PXA270 had the same problem in the opposite scenario - when the MMU was enabled. I decided to comment out the hack and it made the boot-up progress further. (PXA Cache Hack).

Stacks are not easy either.

Next thing I know is that soon after enabling the MMU the code crashes when trying to access the stack at a very peculiar virtual address of 0xffffce70. I examined the QEMU translation code and found out that it correctly locates the physical address but the permission bits are incorrect. I decided to force it to return RW access for the particular address range and voila - Windows Mobile boots to Home Screen successfully. (Hack to force RW stack).

Windows Mobile 5.0 Smartphone Edition on QEMU on Mac OS X.
Now that everything seemed to boot, I decided to take another look at the MMU issue and fix it properly. The first idea I had was to compare ARM920T (ARMv4) and ARM1136 (ARMv5) page table formats. It is worth noting that ARMv4 did not have TEX remap bits, and also last level page tables had different type (last two bits). It turned out that QEMU (probably as real ARM CPUs) supported all types of pages, even ARMv4 on ARMv6 target, and TEX/caching bits were simply ignored. After careful examination of the code and all the page table parsing I found a typo that was already fixed in the upstream QEMU (MMU Typo).

You can grab the cleaner tree without the ugly intermediate hacks: https://github.com/astarasikov/qemu/tree/wm5 . If you need to have a look at the older WIP stuff, it's at https://github.com/astarasikov/qemu/tree/wm5_with_hacks .

Running Pocket PC version.

Preparing PocketPC Image.

Windows Mobile Standard aka Smartphone comes shipped as the NOR flash image and it is XIP (Execute-in-Place). Pocket PC Image comes in a different form. It comes as an image already relocated to the RAM so we need to launch it directly via the "-kernel" parameter. We need to create the NOR memory image for it. We can use the smartphone image as a base, but an empty 32MB file should work as well.

  • Grab the "romtools" scripts at github: https://github.com/pinkavaj/romtools
  • python b000ff-to-bin.py ../ppc_50/_208PPC_USA_bin
  • dd if=_208PPC_USA_bin-80070000-81491ed0.bin of=SPHIDPI_BIN bs=$((0x30000)) seek=1 conv=notrunc
  • ./arm-softmmu/qemu-system-arm -show-cursor -m 256 -M mini2440 -serial stdio -kernel ../wm5/50_ppc/_208PPC_USA_bin-80070000-81491ed0.bin  -pflash ../wm5/50_ppc/SPHIDPI_BIN

This one did not boot and kept hanging at the boot splash screen. The host CPU usage was high and it indicated there was some busy activity inside the VM like an IRQ storm. After hooking up the debugger it turned out that the VM was hanging in two busy loops.

One of them was in the Audio driver - during the splash screen Windows plays a welcome sound. This was worked around by setting the "TX FIFO Ready" status in the sound codec. The second freeze was in the touchscreen driver but that looks like a MINI2440 emulation bug - once the touchscreen is disabled, the workqueue timer in qemu is permanently disabled and the status bits are not updated properly. I commented out the routine which disabled the touchscren controller since it's a VM anyway (not a real device where it would have a power-saving impact).
Obligatory screensot: Windows Mobile 5.0 Pocket PC

Retarded idea.

I was fantasizing about adding a logic to QEMU which would detect if emulation was stuck invoking MMIO device handler in a loop and fuzzing the returned register value until it was unstuck. While not very practical, one could reverse-engineer the register bits blindly this way.

Further ideas.

I don't think I'll invest more time into this project because there's little value but I'm considering developing an Android port of this QEMU fork just for the fun of it. Perhaps a better option would be to emulate a newer SoC such as Qualcomm 8250 and run an ARMv7 image from HTC HD2.

It's quite sad that most if not all Android phones ship with HYP mode (Hypervisor) disabled so KVM is a no-go. On the other hand, allowing to run custom hypervisors opens up a pandora box of security issues so it might be a good decision. Luckily for us who like to tinker with hardware, most TrustZone implementations also contain exploitable bugs so there are some possibilities to uncover the potential :)

Today, many companies working with embedded SoCs, seek GPU virtualization solutions. The demand is particularly high in the automotive sector where people are forced to use eclectic mixes of relatively old SoCs and retaining binary compatibility. So it would be interesting to prototype a solution similar to Intel's GVT.

I think it would be nice to use Qualcomm Adreno GPU as an emulation target. It is relatively well reverse-engineered - there exists Mesa FreeDreno driver for it, Qualcomm commits patches to the upstream KMS driver and most importantly the GPU ISA is similar to older AMD Radeon GPUs which is extensively documented by AMD. Besides, a similar GPU is used in Xbox 360 so one more place to learn is Xenia emulator which simulates the Xbox GPU.

Wednesday, September 28, 2016

Trying out and debugging Open-Source OpenCL on AMD Radeon RX480

Introduction.

I have decided to check out the state of OpenCL support for AMD GPUs in Linux Open-Source stack. In this post I will describe the experience of debugging issues across the whole stack trying to get a few basic apps up and running.

I have the AMD RX480 GPU which supports GCN 8.0.1 instruction set and has the code name "polaris10". At the time of writing, both Linux kernel and the Mesa in Debian were too old and did not support this GPU. Besides, as we will see later, the OpenCL in Mesa does not work out of the box and we would need to learn to build it from source anyway.

Building the software.

A page at FreeDesktop website has some relatively outdated instructions but the general approach is the same. https://dri.freedesktop.org/wiki/GalliumCompute/
You will need to install a relatively fresh kernel (I built Linux 4.7.0-rc7). I also installed the polaris10 firmware manually but now it seems to be shipped in Linux/Debian.

I'm posting the steps I went through to build LLVM and Mesa with OpenCL support. After writing it I realized that perhaps everything here is redundant I should write a repo manifest to clone everything with one command.

Getting the sources.

I'm also posting the folder tree and git hashes just in case.

Build CLC

CLC is the runtime library for the OpenCL. It contains code that is compiled to LLVM bitcode and linked against your apps. It provides intrinsics and functions for the functions defined by the OpenCL standard such as "get_global_id" and "mad". In case you're wondering, CUDA works exactly the same way and the binary SDK from NVIDIA ships the relevant bitcodes (and you can disassemble them with llvm-dis if you're interested).

Actually, some of the definitions failed to compile with latest LLVM and I needed to add the explicit type casting. You can use the patch (from this github gist https://gist.github.com/astarasikov/9f00dee718f217c6a9715510dc09d300) and apply it on top of 88b82a6f70012a903b10dfc1e2304d3ef2e76dbc to fix it.

git clone https://github.com/llvm-mirror/libclc.git
./configure.py -g ninja --with-llvm-config=/home/alexander/Documents/workspace/builds/llvm/build/bin/llvm-config
ninja
ninja install

Get LLVM Sources.

mkdir -p ~/Documents/workspace/builds/llvm/
I have a useful script to check out the latest llvm, clang and libcxx. https://gist.github.com/astarasikov/a2e9287a34381f680d58

Get LLVM ROC branch.

This is optional. I used the ROC branch initially because I thought it would fix the issue with codegen (FLAT instructions) but it did not and otherwise it seems to behave identical to vanilla LLVM.

git remote add roc https://github.com/RadeonOpenCompute/llvm.git
git checkout roc/amd-common
git fetch roc
cd tools/clang/
git remote add roc https://github.com/RadeonOpenCompute/clang.git
git fetch roc
git checkout roc/amd-common

List of git hashes:
  • llvm - 1819637 Merge branch amd-master into amd-common
  • llvm/tools/clang/tools/extra - 079dd6a [clang-rename] Add comment after namespace closing
  • llvm/tools/clang - f779a93 Merge branch amd-master into amd-common
  • llvm/projects/libcxx - d979eed Fix Bug 30240 - std::string: append(first, last) error when aliasing.
  • llvm/projects/compiler-rt - 5a27c81 asan: allow __asan_{before,after}_dynamic_init without registered globals
  • llvm/projects/libcxxabi - 9f08403 [lit] Replace print with lit_config.note().

Build LLVM.

#create the build directory
mkdir -p ~Documents/workspace/builds/llvm/build/
cd cd ~Documents/workspace/builds/llvm/build/

cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_INCLUDE_TESTS=OFF -DLLVM_VERSION_SUFFIX="" ../llvm/ -DBUILD_SHARED_LIBS=ON
ninja

#add the symlink to make LLVM pick up internal headers when building Mesa
cd ~Documents/workspace/builds/llvm/build/include
ln -s $(echo $PWD/../tools/clang/include/clang) clang

In principle, it is not necessary to install llvm, it's enough to add it to the PATH and clang will pick up the necessary libraries itself.

Build MESA

Before building Mesa, we need to prepend the path to the "bin" directory of our custom LLVM build to the PATH variable so that clang is picked up as the compiler. I also had to add a symlink to the source code in the build directory because some headers were not getting picked up but I think there's a cleaner way to add it to CFLAGS.

I was using Mesa git 0d7ec8b7d0554382d5af6c59a69ca9672d2583cd.

git clone git://anongit.freedesktop.org/mesa/mesa

The configure.ac seems to have the incorrect regex for getting LLVM version which causes compilation to fail with the latest LLVM (4.0.0). Here is a patch to fix it and also force the radeonsi chip class to VI (Volcanic Islands). The latter is not strictly necessary but I used it during debugging to ensure the corect code path is always hit. Grab the diff at https://gist.github.com/astarasikov/6146dbbd07d0dc3bea2ee6a8b979eaa8

export PATH=~/Documents/workspace/builds/llvm/build/bin:$PATH
cd ~/Documents/workspace/builds/mesa/mesa/
make clean

./autogen.sh --enable-texture-float  --enable-dri3 --enable-opencl --enable-opencl-icd --enable-sysfs --enable-gallium-llvm --with-gallium-drivers=radeonsi --prefix=/opt/my_mesa --with-egl-platforms=drm --enable-glx-tls
make install

Now, before running any OpenCL application, we'll need to override the library path to point to our custom Mesa.
export LD_LIBRARY_PATH=/opt/my_mesa/lib:/home/alexander/Documents/workspace/builds/llvm/build/lib

Useful Links

AMD Presentations about GCN ISA.

GCN ISA Manual

Intel OpenCL Samples

AMD App SDK

I've used the older version because I thought it was in the tarball and the latest one seemed to be an executable file (though actually it was a tarball with an executable script).

Trying it out.

Assertion in LLVM codegen.

So we can try running any OpenCL application now and we'll hit the assertion.
./BitonicSort -p Clover

After reading the source code and LLVM git log it turns out.
So what can we do? Let's try to see if we can force the LLVM to emit the abovementioned "FLAT Atomics".
Turns out the code is already there and there is a flag, which is enabled by default when the LLVM target is "AMD HSA".

Now, let's think what could the possible limitations of this approach be?

FLAT Instructions

Let's see the description from the GCN Manual. "Flat memory instructions let the kernel read or write data in memory, or perform atomic operations on data already in memory. These operations occur through the texture L2 cache.".

I have not fully understood the difference between these new FLAT instructions and older MUBUF/MTBUF. As it seems to me, before GCN3 it used to be the case that different address spaces (global, private etc) could only be accessed through different instructions and FLAT instructions allow accessing any piece of GPU (and host) memory by the virtual address (hence the name since virtual address space is flat). So it seems that as long as the kernel driver sets up GPU page tables correctly and we're only using the memory allocated through OpenCL API we should be fine.

FWIW, let's try running the code and see if it works.

LLVM Flat Instructions hack

As it has been mentioned before, we need to force LLVM to generate the "FLAT" instructions to access memory in GPU code.
A proper way to fix it would be to add the options to the Mesa source code into the location where it instantiates the LLVM compiler (clang).
To save us some time we can hack the relevant piece of the code generator in the LLVM directly (see the patch at the end of the article).

Trying it out again.

I have tried the following samples from the AMD App SDK and Intel Samples. I didn't want to write sample code myself and besides OpenCL claims to be portable so running the code from other IHVs or SDKs should be a good stress test of the toolchain.

AMD

  • BlackScholesDP
  • BinominalOption
  • BitonicSort
  • BinarySearch
  • BufferBandwidth

Intel

  • Bitonic Sort
  • Montecarlo
  • God Rays
All of them worked and the "verification" step which computes the data on the host and compares to the GPGPU result has passed! You can take a look at the screenshots and logs at the end of the post.

The "God Rays" demo even produces the convinceable picture.

Running Protonect (libfreenect2)

One of the apps I'm particularly interested in running is Protonect which is the demo app for libfreenect2, the open-source driver for Microsoft Kinect v2 RGBD ToF camera. Let's build it with OpenCL support and invoke it from the shell via "./Protonect cl".

And we're hitting an assertion!
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.
Since it happens in a destructor for the purposes of testing we can simply comment it out because the worst thing that could happen is a memory leak.

Let's try running it again!
And we're hitting another error. This time it is "unsupported initializer for address space". Okay, let's debug it. First, let's grep the string verbatim.

Good, we're hitting in only one place. Now, debugging message is not particularly helpful because it does not give us the precise location or the name of the variable which caused this error, it only shows the prototype of the function. Let's try to just print the address space type and try to find out what might be causing it. (see the patch at the end of the article).

What's this? Good, good, we see that the address space enum value is "0". Looking it up reveals that it's a private address space. Okay, what could cause the private address space to be used? Function-local arrays! Let's look for one! Great, here it is, see the "const float gaussian[9]" array in the "filterPixelStage1" function? Let's try commenting it out and replacing the "gaussian[j]" access with some constant (let it be 1.0f since it's a weighed average, otherwise if we choose 0.0f we'll see nothing in the output buffer). Yay, it worked!

Since we could not longer use the private address space we would need to find a way to get rid of the locally-declared array.OpenCL 1.1 does not support the static storage class.
One option would be to add another kernel argument and just pass the array there.
It might be slower depending though because it would get placed into the slower region of cache-coherent memory.

Another option would be to compute the values in-place and since it's just a 3x3 convolution kernel for a gaussian blur it's easy to come up with a crude approximation formula which is what I've done (see the patch at the end of the post).

Sor far, Protonect works. Performance is subpar but not completely awful. It's around 4 times slower than the NVIDIA GTX970 with the binary driver (in most real-world scenarios GTX970 and RX480 are quite close). I think that with a bit of profiling it can be sped up drastically. In fact, one of the contributing factors might be that my display is connected to the Intel iGPU and PCIE bandwidth is saturated by the framebuffer blitting (it's 4K after all). I'll try with OpenGL running on Radeon next time.

RadeonOpenCompute Initiative.

Since OpenCL is crap and CUDA is all the hype, it makes now wonder many people want to run CUDA on all GPUs.
In principle, to run CUDA on non-NVIDIA hardware, one needs to implement the CUDA runtime and the lowering from the NVPTX intermediate language to the native ISA. The challenging part is actually building the source code because one would need the proprietary headers from the NVIDIA toolchain. One could create the fake headers to mimic the CUDA toolchain but it is a shady legal area especially when it comes down to predefined macros and constants.

In fact, there's not much need to emit NVPTX at all and you can lower straight to the ISA. What AMD have done to work around legal issues is they've come up with their own language called "HIP" which mimics most of CUDA design but keywords and predefined macros are named differently. Therefore, porting is straightforward even with a search/replace function, but there's an automated translator based on clang.

GCN old RX480 vs R390

It looks interesting that Polaris10 (RX480) seems to use an older version of the ISA (8.0.1) while the older R390 uses 8.0.3. Not sure if it's a bug in documentation. However, it's interesting that AMD GPUs consist of multiple units (such as video encoder/decoder) and they seem to be picked up in arbitrary order when designing a new chip.

HIP without Mesa.

HIP/ROC ships its own runtime, and since all memory access is done through DRM via ioctls to "/dev/drm/cardX", Mesa is tecnhically not needed to implement OpenCL or whatever compute API.

However, the open question is buffer sharing between different APIs. I have came across this issue before when dealing with an Intel GPU. The good news is that on Linux there's an EGL extension to export the DRM buffer object (BO) from the GLES context (but not from GLX). You can read the old article about it here: https://allsoftwaresucks.blogspot.com/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html

Outstanding issues.

Slow Compilation time for Kernels.

While running OpenCL code samples and Protonect, I've noticed that they take several seconds to start up compared to immediate start when using Intel OpenCL driver (Beignet). I think it might get caused by LLVM. It would be a good idea to profile everything using the Linux "perf" tool.

Private Address Spaces.

As we've seen with libfreenect2 kernels, private address spaces do not work. It is necessary to figure out if they are not supported at all or it is just a simple bug.
Until this is resolved it effectively renders a lot of GPGPU applications unusable.

LLVM Assertions (Bugs).

As mentioned before, running Protonect and compiling libfreenect2 kernels yields the following assertion:
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

While trying to run one of the OpenCL samples I hit yet another assertion:
'clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.

Conslusions.

As we can see, the OpenCL support is still not mature with the Mesa/Clover stack.
However given the RadeonOpenCompute initiative I'm sure most of the missing features will be in place soon.
So far I'm glad that the just-releases GPU support is on par with older models and working around the issues was not too hard for me.
I've also satisfied part of my desire to understand the interaction between different components involved in the OpenCL/Compute pipeline.

I think for a start I will look at the LLVM assertions and see if I can debug them or prepare the test cases to submit upstream.
Next up I'll be trying out HIP to build some CUDA samples.

One idea I had in mind for quite some time was virtualizing a mobile GPU. I think Qualcomm Adreno is a good target because it's relatively well supported by the FreeDreno driver and the ISA is similar to other AMD chips. The plan is to add the ISA decoder and MMIO space emulation to QEMU so that it can be used both in KVM on ARM and in emulation mode on Intel. Of course, the most nerdy way to do it would be to make a translator from the guest ISA to the host ISA. But for a start we could reuse the Virgil driver as a target.
I think it would be a very useful thing for running legacy applications in a virtualized environment (such as Windows RT or automotive IVI systems) and could aid in security engineering.
Hopefully I will have enough motivation and time to do it before I'm bounded by an NDA :)

Latest update!

Also, check out the latest news! Looks like Mesa has now switched to using the HSA ABI by default which means that the hack for the FLAT instructions will not be needed with more recent versions and they will be enabled automagically! https://www.phoronix.com/scan.php?page=news_item&px=RadeonSI-HSA-Compute-Shaders

I started trying OpenCL on RX480 around 2 weeks ago, then I spent 1 week debugging and 1 week I was away. Meanwhile some changes seem to have landed upstream and some of the hacking described here may be redundant. I urge you to check with the latest source code but I decided to keep this post just to describe the process of debugging I went through.

Extra: Logs and Screenshots.

God Rays from Intel OpenCL samples.
Protonect running on AMD FOSS OpenCL stack.

LLVM Force FLAT instructions hack.

diff --git a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 3c4b5e7..f6d500c 100644
--- a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -46,7 +46,7 @@ AMDGPUSubtarget::initializeSubtargetDependencies(const Triple &TT,
   // disable it.

   SmallString<256> FullFS("+promote-alloca,+fp64-denormals,+load-store-opt,");
-  if (isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
+  if (1 || isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
     FullFS += "+flat-for-global,+unaligned-buffer-access,";
   FullFS += FS;

Patch for Libfreenect2

@@ -102,8 +102,8 @@ void kernel processPixelStage1(global const short *lut11to16, global const float
 /*******************************************************************************
  * Filter pixel stage 1
  ******************************************************************************/
-void kernel filterPixelStage1(global const float3 *a, global const float3 *b, global const float3 *n,
-                              global float3 *a_out, global float3 *b_out, global uchar *max_edge_test)
+void kernel filterPixelStage1(__global const float3 *a, __global const float3 *b, __global const float3 *n,
+                              __global float3 *a_out, __global float3 *b_out, __global uchar *max_edge_test)
 {
   const uint i = get_global_id(0);

@@ -113,7 +113,7 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
   const float3 self_a = a[i];

   const float3 self_b = b[i];

-  const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};
+  //const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};

   if(x < 1 || y < 1 || x > 510 || y > 422)
   {
@@ -155,7 +155,9 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
         const int3 c1 = isless(other_norm * other_norm, threshold);

         const float3 dist = 0.5f * (1.0f - (self_normalized_a * other_normalized_a + self_normalized_b * other_normalized_b));
-        const float3 weight = select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        //const float3 weight = 1.0f;//select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        const float gj = exp(0.6 - (0.3 * (abs(yi) + abs(xi))));
+        const float3 weight = select(gj * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);

LLVM Patch for assertion in AsmPrinter destructor.

diff --git a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 0fed4e9..0d63a2a 100644
--- a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -114,6 +114,7 @@ AsmPrinter::AsmPrinter(TargetMachine &tm, std::unique_ptr<MCStreamer> Streamer)
 }

 AsmPrinter::~AsmPrinter() {
+       return;
   assert(!DD && Handlers.empty() && "Debug/EH info didn't get finalized");

   if (GCMetadataPrinters) {

Patch for debugging Address Space issues.

diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 682157b..d2a5c4a 100644
--- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -766,6 +766,8 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
     unsigned Offset = MFI->allocateLDSGlobal(DL, *GV);
     return DAG.getConstant(Offset, SDLoc(Op), Op.getValueType());
   }
+  default:
+       printf("%s: address space type=%d\n", __func__, G->getAddressSpace());
   }

   const Function &Fn = *DAG.getMachineFunction().getFunction();

OpenCL Bandwidth Test (AMD App SDK)

Intel (Beignet GPGPU Driver)

Platform found : Intel

Device  0            Intel(R) HD Graphics Haswell GT2 Desktop
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.11  ns
Page fault           531.44  ns
CPU read             15.31 GB/s
memcpy()             15.54 GB/s
memset(,1,)          26.54 GB/s
memset(,0,)          27.06 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 9513.290
 ---------------------------------------|---------------
 memset() (GBPS)                        | 24.746
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 6176.168


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 38.225

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 26.198


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9830.400
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 15.431
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 10485.760

 Verification Passed!


Passed!

AMD Radeon (OpenCL) 

Platform found : Mesa


Device  0            AMD POLARIS10 (DRM 3.2.0 / 4.7.0-rc7-meow+, LLVM 4.0.0)
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.12  ns
Page fault           538.31  ns
CPU read             12.19 GB/s
memcpy()             11.38 GB/s
memset(,1,)          20.93 GB/s
memset(,0,)          22.98 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 7586.161
 ---------------------------------------|---------------
 memset() (GBPS)                        | 6.369
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 12822.261


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 113.481

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 105.898


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9.559
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 17.179
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 4060.750

 Verification Passed!


Passed!