Wednesday, September 28, 2016

Trying out and debugging Open-Source OpenCL on AMD Radeon RX480

Introduction.

I have decided to check out the state of OpenCL support for AMD GPUs in Linux Open-Source stack. In this post I will describe the experience of debugging issues across the whole stack trying to get a few basic apps up and running.

I have the AMD RX480 GPU which supports GCN 8.0.1 instruction set and has the code name "polaris10". At the time of writing, both Linux kernel and the Mesa in Debian were too old and did not support this GPU. Besides, as we will see later, the OpenCL in Mesa does not work out of the box and we would need to learn to build it from source anyway.

Building the software.

A page at FreeDesktop website has some relatively outdated instructions but the general approach is the same. https://dri.freedesktop.org/wiki/GalliumCompute/
You will need to install a relatively fresh kernel (I built Linux 4.7.0-rc7). I also installed the polaris10 firmware manually but now it seems to be shipped in Linux/Debian.

I'm posting the steps I went through to build LLVM and Mesa with OpenCL support. After writing it I realized that perhaps everything here is redundant I should write a repo manifest to clone everything with one command.

Getting the sources.

I'm also posting the folder tree and git hashes just in case.

Build CLC

CLC is the runtime library for the OpenCL. It contains code that is compiled to LLVM bitcode and linked against your apps. It provides intrinsics and functions for the functions defined by the OpenCL standard such as "get_global_id" and "mad". In case you're wondering, CUDA works exactly the same way and the binary SDK from NVIDIA ships the relevant bitcodes (and you can disassemble them with llvm-dis if you're interested).

Actually, some of the definitions failed to compile with latest LLVM and I needed to add the explicit type casting. You can use the patch (from this github gist https://gist.github.com/astarasikov/9f00dee718f217c6a9715510dc09d300) and apply it on top of 88b82a6f70012a903b10dfc1e2304d3ef2e76dbc to fix it.

git clone https://github.com/llvm-mirror/libclc.git
./configure.py -g ninja --with-llvm-config=/home/alexander/Documents/workspace/builds/llvm/build/bin/llvm-config
ninja
ninja install

Get LLVM Sources.

mkdir -p ~/Documents/workspace/builds/llvm/
I have a useful script to check out the latest llvm, clang and libcxx. https://gist.github.com/astarasikov/a2e9287a34381f680d58

Get LLVM ROC branch.

This is optional. I used the ROC branch initially because I thought it would fix the issue with codegen (FLAT instructions) but it did not and otherwise it seems to behave identical to vanilla LLVM.

git remote add roc https://github.com/RadeonOpenCompute/llvm.git
git checkout roc/amd-common
git fetch roc
cd tools/clang/
git remote add roc https://github.com/RadeonOpenCompute/clang.git
git fetch roc
git checkout roc/amd-common

List of git hashes:
  • llvm - 1819637 Merge branch amd-master into amd-common
  • llvm/tools/clang/tools/extra - 079dd6a [clang-rename] Add comment after namespace closing
  • llvm/tools/clang - f779a93 Merge branch amd-master into amd-common
  • llvm/projects/libcxx - d979eed Fix Bug 30240 - std::string: append(first, last) error when aliasing.
  • llvm/projects/compiler-rt - 5a27c81 asan: allow __asan_{before,after}_dynamic_init without registered globals
  • llvm/projects/libcxxabi - 9f08403 [lit] Replace print with lit_config.note().

Build LLVM.

#create the build directory
mkdir -p ~Documents/workspace/builds/llvm/build/
cd cd ~Documents/workspace/builds/llvm/build/

cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_INCLUDE_TESTS=OFF -DLLVM_VERSION_SUFFIX="" ../llvm/ -DBUILD_SHARED_LIBS=ON
ninja

#add the symlink to make LLVM pick up internal headers when building Mesa
cd ~Documents/workspace/builds/llvm/build/include
ln -s $(echo $PWD/../tools/clang/include/clang) clang

In principle, it is not necessary to install llvm, it's enough to add it to the PATH and clang will pick up the necessary libraries itself.

Build MESA

Before building Mesa, we need to prepend the path to the "bin" directory of our custom LLVM build to the PATH variable so that clang is picked up as the compiler. I also had to add a symlink to the source code in the build directory because some headers were not getting picked up but I think there's a cleaner way to add it to CFLAGS.

I was using Mesa git 0d7ec8b7d0554382d5af6c59a69ca9672d2583cd.

git clone git://anongit.freedesktop.org/mesa/mesa

The configure.ac seems to have the incorrect regex for getting LLVM version which causes compilation to fail with the latest LLVM (4.0.0). Here is a patch to fix it and also force the radeonsi chip class to VI (Volcanic Islands). The latter is not strictly necessary but I used it during debugging to ensure the corect code path is always hit. Grab the diff at https://gist.github.com/astarasikov/6146dbbd07d0dc3bea2ee6a8b979eaa8

export PATH=~/Documents/workspace/builds/llvm/build/bin:$PATH
cd ~/Documents/workspace/builds/mesa/mesa/
make clean

./autogen.sh --enable-texture-float  --enable-dri3 --enable-opencl --enable-opencl-icd --enable-sysfs --enable-gallium-llvm --with-gallium-drivers=radeonsi --prefix=/opt/my_mesa --with-egl-platforms=drm --enable-glx-tls
make install

Now, before running any OpenCL application, we'll need to override the library path to point to our custom Mesa.
export LD_LIBRARY_PATH=/opt/my_mesa/lib:/home/alexander/Documents/workspace/builds/llvm/build/lib

Useful Links

AMD Presentations about GCN ISA.

GCN ISA Manual

Intel OpenCL Samples

AMD App SDK

I've used the older version because I thought it was in the tarball and the latest one seemed to be an executable file (though actually it was a tarball with an executable script).

Trying it out.

Assertion in LLVM codegen.

So we can try running any OpenCL application now and we'll hit the assertion.
./BitonicSort -p Clover

After reading the source code and LLVM git log it turns out.
So what can we do? Let's try to see if we can force the LLVM to emit the abovementioned "FLAT Atomics".
Turns out the code is already there and there is a flag, which is enabled by default when the LLVM target is "AMD HSA".

Now, let's think what could the possible limitations of this approach be?

FLAT Instructions

Let's see the description from the GCN Manual. "Flat memory instructions let the kernel read or write data in memory, or perform atomic operations on data already in memory. These operations occur through the texture L2 cache.".

I have not fully understood the difference between these new FLAT instructions and older MUBUF/MTBUF. As it seems to me, before GCN3 it used to be the case that different address spaces (global, private etc) could only be accessed through different instructions and FLAT instructions allow accessing any piece of GPU (and host) memory by the virtual address (hence the name since virtual address space is flat). So it seems that as long as the kernel driver sets up GPU page tables correctly and we're only using the memory allocated through OpenCL API we should be fine.

FWIW, let's try running the code and see if it works.

LLVM Flat Instructions hack

As it has been mentioned before, we need to force LLVM to generate the "FLAT" instructions to access memory in GPU code.
A proper way to fix it would be to add the options to the Mesa source code into the location where it instantiates the LLVM compiler (clang).
To save us some time we can hack the relevant piece of the code generator in the LLVM directly (see the patch at the end of the article).

Trying it out again.

I have tried the following samples from the AMD App SDK and Intel Samples. I didn't want to write sample code myself and besides OpenCL claims to be portable so running the code from other IHVs or SDKs should be a good stress test of the toolchain.

AMD

  • BlackScholesDP
  • BinominalOption
  • BitonicSort
  • BinarySearch
  • BufferBandwidth

Intel

  • Bitonic Sort
  • Montecarlo
  • God Rays
All of them worked and the "verification" step which computes the data on the host and compares to the GPGPU result has passed! You can take a look at the screenshots and logs at the end of the post.

The "God Rays" demo even produces the convinceable picture.

Running Protonect (libfreenect2)

One of the apps I'm particularly interested in running is Protonect which is the demo app for libfreenect2, the open-source driver for Microsoft Kinect v2 RGBD ToF camera. Let's build it with OpenCL support and invoke it from the shell via "./Protonect cl".

And we're hitting an assertion!
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.
Since it happens in a destructor for the purposes of testing we can simply comment it out because the worst thing that could happen is a memory leak.

Let's try running it again!
And we're hitting another error. This time it is "unsupported initializer for address space". Okay, let's debug it. First, let's grep the string verbatim.

Good, we're hitting in only one place. Now, debugging message is not particularly helpful because it does not give us the precise location or the name of the variable which caused this error, it only shows the prototype of the function. Let's try to just print the address space type and try to find out what might be causing it. (see the patch at the end of the article).

What's this? Good, good, we see that the address space enum value is "0". Looking it up reveals that it's a private address space. Okay, what could cause the private address space to be used? Function-local arrays! Let's look for one! Great, here it is, see the "const float gaussian[9]" array in the "filterPixelStage1" function? Let's try commenting it out and replacing the "gaussian[j]" access with some constant (let it be 1.0f since it's a weighed average, otherwise if we choose 0.0f we'll see nothing in the output buffer). Yay, it worked!

Since we could not longer use the private address space we would need to find a way to get rid of the locally-declared array.OpenCL 1.1 does not support the static storage class.
One option would be to add another kernel argument and just pass the array there.
It might be slower depending though because it would get placed into the slower region of cache-coherent memory.

Another option would be to compute the values in-place and since it's just a 3x3 convolution kernel for a gaussian blur it's easy to come up with a crude approximation formula which is what I've done (see the patch at the end of the post).

Sor far, Protonect works. Performance is subpar but not completely awful. It's around 4 times slower than the NVIDIA GTX970 with the binary driver (in most real-world scenarios GTX970 and RX480 are quite close). I think that with a bit of profiling it can be sped up drastically. In fact, one of the contributing factors might be that my display is connected to the Intel iGPU and PCIE bandwidth is saturated by the framebuffer blitting (it's 4K after all). I'll try with OpenGL running on Radeon next time.

RadeonOpenCompute Initiative.

Since OpenCL is crap and CUDA is all the hype, it makes now wonder many people want to run CUDA on all GPUs.
In principle, to run CUDA on non-NVIDIA hardware, one needs to implement the CUDA runtime and the lowering from the NVPTX intermediate language to the native ISA. The challenging part is actually building the source code because one would need the proprietary headers from the NVIDIA toolchain. One could create the fake headers to mimic the CUDA toolchain but it is a shady legal area especially when it comes down to predefined macros and constants.

In fact, there's not much need to emit NVPTX at all and you can lower straight to the ISA. What AMD have done to work around legal issues is they've come up with their own language called "HIP" which mimics most of CUDA design but keywords and predefined macros are named differently. Therefore, porting is straightforward even with a search/replace function, but there's an automated translator based on clang.

GCN old RX480 vs R390

It looks interesting that Polaris10 (RX480) seems to use an older version of the ISA (8.0.1) while the older R390 uses 8.0.3. Not sure if it's a bug in documentation. However, it's interesting that AMD GPUs consist of multiple units (such as video encoder/decoder) and they seem to be picked up in arbitrary order when designing a new chip.

HIP without Mesa.

HIP/ROC ships its own runtime, and since all memory access is done through DRM via ioctls to "/dev/drm/cardX", Mesa is tecnhically not needed to implement OpenCL or whatever compute API.

However, the open question is buffer sharing between different APIs. I have came across this issue before when dealing with an Intel GPU. The good news is that on Linux there's an EGL extension to export the DRM buffer object (BO) from the GLES context (but not from GLX). You can read the old article about it here: https://allsoftwaresucks.blogspot.com/2014/10/abusing-mesa-by-hooking-elfs-and-ioctl.html

Outstanding issues.

Slow Compilation time for Kernels.

While running OpenCL code samples and Protonect, I've noticed that they take several seconds to start up compared to immediate start when using Intel OpenCL driver (Beignet). I think it might get caused by LLVM. It would be a good idea to profile everything using the Linux "perf" tool.

Private Address Spaces.

As we've seen with libfreenect2 kernels, private address spaces do not work. It is necessary to figure out if they are not supported at all or it is just a simple bug.
Until this is resolved it effectively renders a lot of GPGPU applications unusable.

LLVM Assertions (Bugs).

As mentioned before, running Protonect and compiling libfreenect2 kernels yields the following assertion:
'llvm::AsmPrinter::~AsmPrinter(): Assertion `!DD && Handlers.empty() && "Debug/EH info didn't get finalized"'.

While trying to run one of the OpenCL samples I hit yet another assertion:
'clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.

Conslusions.

As we can see, the OpenCL support is still not mature with the Mesa/Clover stack.
However given the RadeonOpenCompute initiative I'm sure most of the missing features will be in place soon.
So far I'm glad that the just-releases GPU support is on par with older models and working around the issues was not too hard for me.
I've also satisfied part of my desire to understand the interaction between different components involved in the OpenCL/Compute pipeline.

I think for a start I will look at the LLVM assertions and see if I can debug them or prepare the test cases to submit upstream.
Next up I'll be trying out HIP to build some CUDA samples.

One idea I had in mind for quite some time was virtualizing a mobile GPU. I think Qualcomm Adreno is a good target because it's relatively well supported by the FreeDreno driver and the ISA is similar to other AMD chips. The plan is to add the ISA decoder and MMIO space emulation to QEMU so that it can be used both in KVM on ARM and in emulation mode on Intel. Of course, the most nerdy way to do it would be to make a translator from the guest ISA to the host ISA. But for a start we could reuse the Virgil driver as a target.
I think it would be a very useful thing for running legacy applications in a virtualized environment (such as Windows RT or automotive IVI systems) and could aid in security engineering.
Hopefully I will have enough motivation and time to do it before I'm bounded by an NDA :)

Latest update!

Also, check out the latest news! Looks like Mesa has now switched to using the HSA ABI by default which means that the hack for the FLAT instructions will not be needed with more recent versions and they will be enabled automagically! https://www.phoronix.com/scan.php?page=news_item&px=RadeonSI-HSA-Compute-Shaders

I started trying OpenCL on RX480 around 2 weeks ago, then I spent 1 week debugging and 1 week I was away. Meanwhile some changes seem to have landed upstream and some of the hacking described here may be redundant. I urge you to check with the latest source code but I decided to keep this post just to describe the process of debugging I went through.

Extra: Logs and Screenshots.

God Rays from Intel OpenCL samples.
Protonect running on AMD FOSS OpenCL stack.

LLVM Force FLAT instructions hack.

diff --git a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 3c4b5e7..f6d500c 100644
--- a/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -46,7 +46,7 @@ AMDGPUSubtarget::initializeSubtargetDependencies(const Triple &TT,
   // disable it.

   SmallString<256> FullFS("+promote-alloca,+fp64-denormals,+load-store-opt,");
-  if (isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
+  if (1 || isAmdHsaOS()) // Turn on FlatForGlobal for HSA.
     FullFS += "+flat-for-global,+unaligned-buffer-access,";
   FullFS += FS;

Patch for Libfreenect2

@@ -102,8 +102,8 @@ void kernel processPixelStage1(global const short *lut11to16, global const float
 /*******************************************************************************
  * Filter pixel stage 1
  ******************************************************************************/
-void kernel filterPixelStage1(global const float3 *a, global const float3 *b, global const float3 *n,
-                              global float3 *a_out, global float3 *b_out, global uchar *max_edge_test)
+void kernel filterPixelStage1(__global const float3 *a, __global const float3 *b, __global const float3 *n,
+                              __global float3 *a_out, __global float3 *b_out, __global uchar *max_edge_test)
 {
   const uint i = get_global_id(0);

@@ -113,7 +113,7 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
   const float3 self_a = a[i];

   const float3 self_b = b[i];

-  const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};
+  //const float gaussian[9] = {GAUSSIAN_KERNEL_0, GAUSSIAN_KERNEL_1, GAUSSIAN_KERNEL_2, GAUSSIAN_KERNEL_3, GAUSSIAN_KERNEL_4, GAUSSIAN_KERNEL_5, GAUSSIAN_KERNEL_6, GAUSSIAN_KERNEL_7, GAUSSIAN_KERNEL_8};

   if(x < 1 || y < 1 || x > 510 || y > 422)
   {
@@ -155,7 +155,9 @@ void kernel filterPixelStage1(global const float3 *a, global const float3 *b, gl
         const int3 c1 = isless(other_norm * other_norm, threshold);

         const float3 dist = 0.5f * (1.0f - (self_normalized_a * other_normalized_a + self_normalized_b * other_normalized_b));
-        const float3 weight = select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        //const float3 weight = 1.0f;//select(gaussian[j] * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);
+        const float gj = exp(0.6 - (0.3 * (abs(yi) + abs(xi))));
+        const float3 weight = select(gj * exp(-1.442695f * joint_bilateral_exp * dist), (float3)(0.0f), c1);

LLVM Patch for assertion in AsmPrinter destructor.

diff --git a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 0fed4e9..0d63a2a 100644
--- a/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -114,6 +114,7 @@ AsmPrinter::AsmPrinter(TargetMachine &tm, std::unique_ptr<MCStreamer> Streamer)
 }

 AsmPrinter::~AsmPrinter() {
+       return;
   assert(!DD && Handlers.empty() && "Debug/EH info didn't get finalized");

   if (GCMetadataPrinters) {

Patch for debugging Address Space issues.

diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 682157b..d2a5c4a 100644
--- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -766,6 +766,8 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
     unsigned Offset = MFI->allocateLDSGlobal(DL, *GV);
     return DAG.getConstant(Offset, SDLoc(Op), Op.getValueType());
   }
+  default:
+       printf("%s: address space type=%d\n", __func__, G->getAddressSpace());
   }

   const Function &Fn = *DAG.getMachineFunction().getFunction();

OpenCL Bandwidth Test (AMD App SDK)

Intel (Beignet GPGPU Driver)

Platform found : Intel

Device  0            Intel(R) HD Graphics Haswell GT2 Desktop
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.11  ns
Page fault           531.44  ns
CPU read             15.31 GB/s
memcpy()             15.54 GB/s
memset(,1,)          26.54 GB/s
memset(,0,)          27.06 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 9513.290
 ---------------------------------------|---------------
 memset() (GBPS)                        | 24.746
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 6176.168


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 38.225

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 26.198


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9830.400
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 15.431
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 10485.760

 Verification Passed!


Passed!

AMD Radeon (OpenCL) 

Platform found : Mesa


Device  0            AMD POLARIS10 (DRM 3.2.0 / 4.7.0-rc7-meow+, LLVM 4.0.0)
Build:               release
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     256.12  ns
Page fault           538.31  ns
CPU read             12.19 GB/s
memcpy()             11.38 GB/s
memset(,1,)          20.93 GB/s
memset(,0,)          22.98 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------


1. Host mapped write to inputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- WRITE (GBPS)     | 7586.161
 ---------------------------------------|---------------
 memset() (GBPS)                        | 6.369
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 12822.261


2. GPU kernel read of inputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 113.481

 Verification Passed!


3. GPU kernel write to outputBuffer
 ---------------------------------------|---------------
 clEnqueueNDRangeKernel() (GBPS)        | 105.898


4. Host mapped read of outputBuffer
 ---------------------------------------|---------------
 clEnqueueMapBuffer -- READ (GBPS)      | 9.559
 ---------------------------------------|---------------
 CPU read (GBPS)                        | 17.179
 ---------------------------------------|---------------
 clEnqueueUnmapMemObject() (GBPS)       | 4060.750

 Verification Passed!


Passed!