Thursday, March 10, 2016

Fuzzing Vulkans, how do they work?

Introduction


Disclaimer: I have not yet fully read the specs on SPIR-V or Vulkan.

I decided to find out how hard it is to crash code working with SPIR-V.
Initially I wanted to crash the actual GPU drivers but for a start I decided
to experiment with GLSLang.

What I got


I used the "afl-fuzz" fuzzer to generate test cases that could crash
the parser. I have briefly examined the generated cases.
I have uploaded the results (which contain the SPIR-V binaries causing
the "spirv-remap" to crash) to the [following location](https://drive.google.com/file/d/0B7wcN-tOkdeRTGItSDhFM0JYUEk/view?usp=sharing)

Some of them trigger assertions in the code (which is not bad, but perhaps
returning an error code and shutting down cleanly would be better).
Some of them cause the code to hang for long time or indefinitely (which is worse
especially if someone intends to use the SPIR-V parser code online in the app).

Perhaps some of the results marked as "hangs" just cause too long compilation
time and could produce more interesting results if the timeout in "afl-fuzz"
is increased.

Two notable examples causing long compilation time are:
"out/crashes/id:000000,sig:06,src:000000,op:flip1,pos:15"
"out/hangs/id:000011,src:000000,op:flip1,pos:538" - for this one I waited for
a minute but it stil did not complete the compilation while causing 100% CPU load.

A log output of "glslang" indicating that most of the error cases found are handled, but with "abort" instead of graceful shutdown.
http://pastebin.com/BnZ63tKJ

NVIDIA

I have also tried using these shaders with the NVIDIA driver (since it was the only hardware I could run a real Vulkan driver on).

I have used the "instancing" demo from [SaschaWillems Repository](https://github.com/SaschaWillems/Vulkan) .
I patched it to accept the path to the binary shader via the command line.
Next, I fed it with the generated
test cases. Some of them triggered segfaults inside the NVIDIA driver.
What is curious is that when i used the "hangs" examples, they also caused
the NVIDIA driver to take extra long time to compile and eventually crash
at random places.

I think it indicates either that there is some common code between the driver
and GLSLang (the reference implementation) or the specification is missing
some sanity check somewhere and the compiler can get stuck optimizing certain
code.
Is there a place in specification that mandates that all the values are
checked to be within the allowed range, and all complex structures (such as
function calls) are checked recursively?



Perhaps I should have a look at other drivers (Mali anyone?).

```
[ 3672.137509] instancing[26631]: segfault at f ip 00007fb4624adebf sp 00007ffefd72e100 error 4 in libnvidia-glcore.so.355.00.29[7fb462169000+1303000]
[ 3914.294222] instancing[26894]: segfault at f ip 00007f00b28fcebf sp 00007ffdb9bab980 error 4 in libnvidia-glcore.so.355.00.29[7f00b25b8000+1303000]
[ 4032.430179] instancing[27017]: segfault at f ip 00007f7682747ebf sp 00007fff46679bf0 error 4 in libnvidia-glcore.so.355.00.29[7f7682403000+1303000]
[ 4032.915849] instancing[27022]: segfault at f ip 00007fb4e4099ebf sp 00007fff3c1ac0f0 error 4 in libnvidia-glcore.so.355.00.29[7fb4e3d55000+1303000]
[ 4033.011699] instancing[27023]: segfault at f ip 00007f7272900ebf sp 00007ffdb54261e0 error 4 in libnvidia-glcore.so.355.00.29[7f72725bc000+1303000]
[ 4033.107939] instancing[27025]: segfault at f ip 00007fbf0353debf sp 00007ffde4387750 error 4 in libnvidia-glcore.so.355.00.29[7fbf031f9000+1303000]
[ 4033.203924] instancing[27026]: segfault at f ip 00007f0f9a6f0ebf sp 00007ffff85a9dd0 error 4 in libnvidia-glcore.so.355.00.29[7f0f9a3ac000+1303000]
[ 4033.299138] instancing[27027]: segfault at 2967000 ip 00007fcb42cab720 sp 00007ffcbad45228 error 6 in libc-2.19.so[7fcb42c26000+19f000]
[ 4033.394667] instancing[27028]: segfault at 36d2000 ip 00007efc789eb720 sp 00007fff26c636d8 error 6 in libc-2.19.so[7efc78966000+19f000]
[ 4033.490918] instancing[27029]: segfault at 167b15e170 ip 00007f3b02095ec3 sp 00007ffd768cbf68 error 4 in libnvidia-glcore.so.355.00.29[7f3b01cbe000+1303000]
[ 4033.586699] instancing[27030]: segfault at 2ffc000 ip 00007fdebcc06720 sp 00007fff4fe59bd8 error 6 in libc-2.19.so[7fdebcb81000+19f000]
[ 4033.682939] instancing[27031]: segfault at 8 ip 00007fb80e7eed50 sp 00007ffe9cd21de0 error 4 in libnvidia-glcore.so.355.00.29[7fb80e410000+1303000]
[ 4374.480872] show_signal_msg: 27 callbacks suppressed
[ 4374.480876] instancing[27402]: segfault at f ip 00007fd1fc3cdebf sp 00007ffe483ff520 error 4 in libnvidia-glcore.so.355.00.29[7fd1fc089000+1303000]
[ 4374.809621] instancing[27417]: segfault at 2e0c3910 ip 00007f39af846e96 sp 00007ffe1c6d8f10 error 6 in libnvidia-glcore.so.355.00.29[7f39af46f000+1303000]
[ 4374.905112] instancing[27418]: segfault at 2dc46a68 ip 00007f7b9ff7af32 sp 00007fff290edf00 error 6 in libnvidia-glcore.so.355.00.29[7f7b9fba2000+1303000]
[ 4375.001019] instancing[27419]: segfault at f ip 00007f5a4e066ebf sp 00007ffe0b775d70 error 4 in libnvidia-glcore.so.355.00.29[7f5a4dd22000+1303000]
[ 4375.096894] instancing[27420]: segfault at f ip 00007f7274d49ebf sp 00007ffe96fdea10 error 4 in libnvidia-glcore.so.355.00.29[7f7274a05000+1303000]
[ 4375.193165] instancing[27421]: segfault at f ip 00007fa3bf3c8ebf sp 00007ffc4117e8d0 error 4 in libnvidia-glcore.so.355.00.29[7fa3bf084000+1303000]
[ 4375.288969] instancing[27423]: segfault at f ip 00007f50e0327ebf sp 00007ffc02aa1d50 error 4 in libnvidia-glcore.so.355.00.29[7f50dffe3000+1303000]
[ 4375.385530] instancing[27424]: segfault at f ip 00007f0d9a32eebf sp 00007ffd0298eb40 error 4 in libnvidia-glcore.so.355.00.29[7f0d99fea000+1303000]
[ 4375.481829] instancing[27425]: segfault at f ip 00007f8400bc5ebf sp 00007ffef0334240 error 4 in libnvidia-glcore.so.355.00.29[7f8400881000+1303000]
[ 4375.576983] instancing[27426]: segfault at 2dec2bc8 ip 00007f52260afec3 sp 00007fffd2bd1728 error 4 in libnvidia-glcore.so.355.00.29[7f5225cd8000+1303000]
```

How to reproduce


Below are the steps I have taken to crash the "spirv-remap" tool.
I believe this issue is worth looking at because some vendors may
choose to build their driver internals based on the reference implementations
which may lead to bugs directly crippling into the software as-is.

0. I have used the Debian Linux box. I have installed the "afl-fuzz" tool,
and also manually copied the Vulkan headers to "/usr/include".

1. cloned the GLSLang repository
```
git clone git@github.com:KhronosGroup/glslang.git
cd glslang
```

2. Compiled it with afl-fuzz
```
mkdir build
cat SetupLinux.sh 
cd build/
cmake -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++ ..
cd ..
```

3. Compiled a sample shader from the GLSL to the SPIR-V format using
```
./build/install/bin/glslangValidator -V -i Test/spv.130.frag
```

4. Verified that the "spirv-remap" tool works on the binary
```
./build/install/bin/spirv-remap -v -i frag.spv -o /tmp/
```

5. Fed the SPIR-V binary to the afl-fuzz
```
afl-fuzz -i in -o out ./build/install/bin/spirv-remap -v -i @@ -o /tmp
```

6. Quickly discovered several crashes. I attach the screenshot of afl-fuzz
in the work.


7. Examined them.

First, I made a hex diff of the good and bad files. The command to generate
the diff is the following:
```
for i in out/crashes/*; do hexdump in/frag.spv > in.hex && hexdump $i > out.hex && diff -Naur in.hex out.hex; done > hex.diff
```

Next, I just ran the tool on all cases and generated the log of crash messages.
```
for i in out/crashes/*; do echo $i && ./build/install/bin/spirv-remap -i $i -o /tmp/ 2>&1 ;done > abort.log
```

Conclusions

Well, there are two obvious conclusions:
1. Vulkan/SPIR-V is still a WIP and drivers are not yet perfect
2. GPU drivers have always been notorious for poor compilers - not only codegen, but parsers and validators. Maybe part of the reason is that CPU compilers simply handle more complex code and therefore more edge cases have been hit already.