OpenCL ROCm 2.0 vs. AMDGPU-PRO Linux

Radeon RX Vega 64 ROCm 2.0 OpenCL versus PAL OpenCL driver in AMDGPU-PRO 18.50. Benchmarks by Michael Larabel for a future article on Phoronix.com.

HTML result view exported from: https://openbenchmarking.org/result/1901167-PTS-OPENCLRO89&sor&grs.

OpenCL ROCm 2.0 vs. AMDGPU-PRO LinuxProcessorMotherboardChipsetMemoryDiskGraphicsAudioMonitorNetworkOSKernelDesktopDisplay ServerDisplay DriverOpenGLCompilerFile-SystemScreen ResolutionROCm 2.0AMDGPU-PRO 18.50 PALAMD Ryzen Threadripper 2990WX 32-Core @ 3.00GHz (32 Cores / 64 Threads)ASUS ROG ZENITH EXTREME (1601 BIOS)AMD Family 17h32768MB16GB Voyager 3.0 + Samsung SSD 970 EVO 500GBAMD Radeon RX Vega 8GB (1630/945MHz)Realtek ALC1220ASUS VP28UIntel I211 + Qualcomm Atheros QCA6174 802.11ac + Wilocity Wil6200 802.11adUbuntu 18.044.15.0-43-generic (x86_64)GNOME Shell 3.28.3X Server 1.19.6amdgpu 18.0.14.5 Mesa 18.0.5 (LLVM 6.0.0)GCC 7.3.0ext43840x2160amdgpu 18.1.994.6.13542OpenBenchmarking.orgCompiler Details- --build=x86_64-linux-gnu --disable-vtable-verify --disable-werror --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-gnu-unique-object --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --enable-libmpx --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc=auto --enable-offload-targets=nvptx-none --enable-plugin --enable-shared --enable-threads=posix --host=x86_64-linux-gnu --program-prefix=x86_64-linux-gnu- --target=x86_64-linux-gnu --with-abi=m64 --with-arch-32=i686 --with-default-libstdcxx-abi=new --with-gcc-major-version-only --with-multilib-list=m32,m64,mx32 --with-target-system-zlib --with-tune=generic --without-cuda-driver -v Processor Details- Scaling Governor: acpi-cpufreq ondemandGraphics Details- GLAMORPython Details- Python 2.7.15rc1 + Python 3.6.7Security Details- __user pointer sanitization + Full AMD retpoline IBPB + SSB disabled via prctl and seccomp

OpenCL ROCm 2.0 vs. AMDGPU-PRO Linuxcl-mem: Readclpeak: Transfer Bandwidth enqueueWriteBufferdarktable: Server Rack - OpenCLcl-mem: Copyjuliagpu: GPUdarktable: Server Room - OpenCLshoc: OpenCL - FFT SPplaidml: No - Inference - Inception V3 - OpenCLshoc: OpenCL - Triadshoc: OpenCL - MD5 Hashrodinia: OpenCL Heartwallplaidml: No - Inference - IMDB LSTM - OpenCLdarktable: Masskrug - OpenCLclpeak: Single-Precision Floatshoc: OpenCL - Texture Read Bandwidthplaidml: No - Inference - ResNet 50 - OpenCLcl-mem: Writeclpeak: Double-Precision Doubleclpeak: Integer Compute INTclpeak: Global Memory Bandwidthlczero: OpenCLplaidml: No - Inference - Mobilenet - OpenCLshoc: OpenCL - Bus Speed Readbackshoc: OpenCL - Bus Speed Downloadclpeak: Transfer Bandwidth enqueueReadBufferclpeak: Kernel Latencydarktable: Boat - OpenCLROCm 2.0AMDGPU-PRO 18.50 PAL16045.440.232211668588161.971075113.886.6916.464.032525.701305344122538483324973623044797.167.1417.0610.564.4839822.490.133642402934692.57863136.386.0717.633.802635.471252842423337982824863614797.167.1410.9241.758.50OpenBenchmarking.org

cl-mem

Benchmark: Read

OpenBenchmarking.orgGB/s, More Is Bettercl-mem 2017-01-13Benchmark: ReadAMDGPU-PRO 18.50 PALROCm 2.090180270360450SE +/- 0.15, N = 3SE +/- 0.06, N = 33981601. (CC) gcc options: -O2 -flto -lOpenCL

clpeak

OpenCL Test: Transfer Bandwidth enqueueWriteBuffer

OpenBenchmarking.orgGBPS, More Is BetterclpeakOpenCL Test: Transfer Bandwidth enqueueWriteBufferROCm 2.0AMDGPU-PRO 18.50 PAL1020304050SE +/- 0.01, N = 3SE +/- 0.21, N = 345.4422.491. (CXX) g++ options: -O3 -rdynamic -lOpenCL

Darktable

Test: Server Rack - Acceleration: OpenCL

OpenBenchmarking.orgSeconds, Fewer Is BetterDarktable 2.4.2Test: Server Rack - Acceleration: OpenCLAMDGPU-PRO 18.50 PALROCm 2.00.05180.10360.15540.20720.259SE +/- 0.00, N = 3SE +/- 0.00, N = 30.130.23

cl-mem

Benchmark: Copy

OpenBenchmarking.orgGB/s, More Is Bettercl-mem 2017-01-13Benchmark: CopyAMDGPU-PRO 18.50 PALROCm 2.080160240320400SE +/- 0.00, N = 3SE +/- 0.06, N = 33642211. (CC) gcc options: -O2 -flto -lOpenCL

JuliaGPU

OpenCL Device: GPU

OpenBenchmarking.orgSamples/sec, More Is BetterJuliaGPU 1.2pts1OpenCL Device: GPUAMDGPU-PRO 18.50 PALROCm 2.050M100M150M200M250MSE +/- 1514546.33, N = 3SE +/- 1617179.19, N = 32402934691668588161. (CC) gcc options: -O3 -march=native -ftree-vectorize -funroll-loops -lglut -lOpenCL -lGL -lm

Darktable

Test: Server Room - Acceleration: OpenCL

OpenBenchmarking.orgSeconds, Fewer Is BetterDarktable 2.4.2Test: Server Room - Acceleration: OpenCLROCm 2.0AMDGPU-PRO 18.50 PAL0.57831.15661.73492.31322.8915SE +/- 0.01, N = 3SE +/- 0.04, N = 41.972.57

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: FFT SP

OpenBenchmarking.orgGFLOPS, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: FFT SPROCm 2.0AMDGPU-PRO 18.50 PAL2004006008001000SE +/- 1.63, N = 3SE +/- 11.99, N = 610758631. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

PlaidML

FP16: No - Mode: Inference - Network: Inception V3 - Device: OpenCL

OpenBenchmarking.orgExamples Per Second, More Is BetterPlaidMLFP16: No - Mode: Inference - Network: Inception V3 - Device: OpenCLAMDGPU-PRO 18.50 PALROCm 2.0306090120150SE +/- 0.61, N = 3SE +/- 1.83, N = 12136.38113.88

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: Triad

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: TriadROCm 2.0AMDGPU-PRO 18.50 PAL246810SE +/- 0.01, N = 3SE +/- 0.02, N = 36.696.071. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: MD5 Hash

OpenBenchmarking.orgGHash/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: MD5 HashAMDGPU-PRO 18.50 PALROCm 2.048121620SE +/- 0.03, N = 3SE +/- 0.05, N = 317.6316.461. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

Rodinia

Test: OpenCL Heartwall

OpenBenchmarking.orgSeconds, Fewer Is BetterRodinia 2.4Test: OpenCL HeartwallAMDGPU-PRO 18.50 PALROCm 2.00.90681.81362.72043.62724.534SE +/- 0.01, N = 3SE +/- 0.00, N = 33.804.031. (CXX) g++ options: -O2 -lOpenCL

PlaidML

FP16: No - Mode: Inference - Network: IMDB LSTM - Device: OpenCL

OpenBenchmarking.orgExamples Per Second, More Is BetterPlaidMLFP16: No - Mode: Inference - Network: IMDB LSTM - Device: OpenCLAMDGPU-PRO 18.50 PALROCm 2.060120180240300SE +/- 0.88, N = 3SE +/- 0.35, N = 3263252

Darktable

Test: Masskrug - Acceleration: OpenCL

OpenBenchmarking.orgSeconds, Fewer Is BetterDarktable 2.4.2Test: Masskrug - Acceleration: OpenCLAMDGPU-PRO 18.50 PALROCm 2.01.28252.5653.84755.136.4125SE +/- 0.04, N = 3SE +/- 0.02, N = 35.475.70

clpeak

OpenCL Test: Single-Precision Float

OpenBenchmarking.orgGFLOPS, More Is BetterclpeakOpenCL Test: Single-Precision FloatROCm 2.0AMDGPU-PRO 18.50 PAL3K6K9K12K15KSE +/- 11.58, N = 3SE +/- 40.57, N = 313053125281. (CXX) g++ options: -O3 -rdynamic -lOpenCL

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: Texture Read Bandwidth

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: Texture Read BandwidthROCm 2.0AMDGPU-PRO 18.50 PAL100200300400500SE +/- 1.31, N = 3SE +/- 1.67, N = 34414241. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

PlaidML

FP16: No - Mode: Inference - Network: ResNet 50 - Device: OpenCL

OpenBenchmarking.orgExamples Per Second, More Is BetterPlaidMLFP16: No - Mode: Inference - Network: ResNet 50 - Device: OpenCLAMDGPU-PRO 18.50 PALROCm 2.050100150200250SE +/- 1.16, N = 3SE +/- 0.86, N = 3233225

cl-mem

Benchmark: Write

OpenBenchmarking.orgGB/s, More Is Bettercl-mem 2017-01-13Benchmark: WriteROCm 2.0AMDGPU-PRO 18.50 PAL80160240320400SE +/- 0.07, N = 3SE +/- 0.82, N = 33843791. (CC) gcc options: -O2 -flto -lOpenCL

clpeak

OpenCL Test: Double-Precision Double

OpenBenchmarking.orgGFLOPS, More Is BetterclpeakOpenCL Test: Double-Precision DoubleROCm 2.0AMDGPU-PRO 18.50 PAL2004006008001000SE +/- 0.95, N = 3SE +/- 1.70, N = 38338281. (CXX) g++ options: -O3 -rdynamic -lOpenCL

clpeak

OpenCL Test: Integer Compute INT

OpenBenchmarking.orgGIOPS, More Is BetterclpeakOpenCL Test: Integer Compute INTROCm 2.0AMDGPU-PRO 18.50 PAL5001000150020002500SE +/- 2.51, N = 3SE +/- 6.75, N = 3249724861. (CXX) g++ options: -O3 -rdynamic -lOpenCL

clpeak

OpenCL Test: Global Memory Bandwidth

OpenBenchmarking.orgGBPS, More Is BetterclpeakOpenCL Test: Global Memory BandwidthROCm 2.0AMDGPU-PRO 18.50 PAL80160240320400SE +/- 0.12, N = 3SE +/- 0.10, N = 33623611. (CXX) g++ options: -O3 -rdynamic -lOpenCL

LeelaChessZero

Backend: OpenCL

OpenBenchmarking.orgNodes Per Second, More Is BetterLeelaChessZero 0.20.1Backend: OpenCLROCm 2.070140210280350SE +/- 1.46, N = 33041. (CXX) g++ options: -lpthread -lz

PlaidML

FP16: No - Mode: Inference - Network: Mobilenet - Device: OpenCL

OpenBenchmarking.orgExamples Per Second, More Is BetterPlaidMLFP16: No - Mode: Inference - Network: Mobilenet - Device: OpenCLAMDGPU-PRO 18.50 PALROCm 2.0100200300400500SE +/- 0.67, N = 3SE +/- 0.09, N = 3479479

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: Bus Speed Readback

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: Bus Speed ReadbackAMDGPU-PRO 18.50 PALROCm 2.0246810SE +/- 0.00, N = 3SE +/- 0.00, N = 37.167.161. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

SHOC Scalable HeterOgeneous Computing

Target: OpenCL - Benchmark: Bus Speed Download

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: OpenCL - Benchmark: Bus Speed DownloadAMDGPU-PRO 18.50 PALROCm 2.0246810SE +/- 0.00, N = 3SE +/- 0.00, N = 37.147.141. (CXX) g++ options: -O2 -lSHOCCommonOpenCL -lSHOCCommon -lOpenCL -lrt

clpeak

OpenCL Test: Transfer Bandwidth enqueueReadBuffer

OpenBenchmarking.orgGBPS, More Is BetterclpeakOpenCL Test: Transfer Bandwidth enqueueReadBufferROCm 2.0AMDGPU-PRO 18.50 PAL48121620SE +/- 0.03, N = 3SE +/- 0.88, N = 917.0610.921. (CXX) g++ options: -O3 -rdynamic -lOpenCL

clpeak

OpenCL Test: Kernel Latency

OpenBenchmarking.orgus, Fewer Is BetterclpeakOpenCL Test: Kernel LatencyROCm 2.0AMDGPU-PRO 18.50 PAL1020304050SE +/- 0.14, N = 3SE +/- 1.09, N = 1210.5641.751. (CXX) g++ options: -O3 -rdynamic -lOpenCL

Darktable

Test: Boat - Acceleration: OpenCL

OpenBenchmarking.orgSeconds, Fewer Is BetterDarktable 2.4.2Test: Boat - Acceleration: OpenCLROCm 2.0AMDGPU-PRO 18.50 PAL246810SE +/- 0.08, N = 12SE +/- 0.08, N = 34.488.50


Phoronix Test Suite v10.8.4