AMD Ryzen Threadripper GCC 10 PGO benchmarks by Michael Larabel for a future article.
GCC 10 Compiler Notes: --disable-multilib --enable-checking=releaseDisk Notes: NONE / errors=remount-ro,relatime,rwProcessor Notes: Scaling Governor: acpi-cpufreq ondemand - CPU Microcode: 0x8301025Security Notes: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl and seccomp + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Full AMD retpoline IBPB: conditional STIBP: conditional RSB filling + tsx_async_abort: Not affected
GCC 10 - PGO Processor: AMD Ryzen Threadripper 3960X 24-Core @ 3.80GHz (24 Cores / 48 Threads), Motherboard: MSI Creator TRX40 (MS-7C59) v1.0 (1.12N1 BIOS), Chipset: AMD Starship/Matisse, Memory: 32768MB, Disk: 1000GB Sabrent Rocket 4.0 1TB, Graphics: Gigabyte AMD Radeon 540/540X/550/550X / RX 540X/550/550X 2GB (1206/1750MHz), Audio: AMD Baffin HDMI/DP, Monitor: ASUS VP28U, Network: Aquantia AQC107 NBase-T/IEEE + Intel I211 + Intel Device 2723
OS: Ubuntu 19.10, Kernel: 5.4.0-nvme-hwmon (x86_64), Desktop: GNOME Shell 3.34.1, Display Server: X Server 1.20.5, Display Driver: modesetting 1.20.5, OpenGL: 4.5 Mesa 19.2.1 (LLVM 9.0.0), Compiler: GCC 10.0.0 20191208, File-System: ext4, Screen Resolution: 3840x2160
HPC Challenge HPC Challenge (HPCC) is a cluster-focused benchmark consisting of the HPL Linpack TPP benchmark, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and communication bandwidth and latency. This HPC Challenge test profile attempts to ship with standard yet versatile configuration/input files though they can be modified. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org GFLOPS, More Is Better HPC Challenge 1.5.0 Test / Class: G-HPL GCC 10 GCC 10 - PGO 14 28 42 56 70 SE +/- 0.23, N = 3 SE +/- 0.02, N = 3 63.63 63.48 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GFLOPS, More Is Better HPC Challenge 1.5.0 Test / Class: G-Ffte GCC 10 GCC 10 - PGO 3 6 9 12 15 SE +/- 0.05, N = 3 SE +/- 0.11, N = 3 10.49 10.64 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GFLOP/s, More Is Better HPC Challenge 1.5.0 Test / Class: G-Ffte GCC 10 GCC 10 - PGO 3 6 9 12 15 SE +/- 0.05, N = 3 SE +/- 0.11, N = 3 10.49 10.64 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GFLOPS, More Is Better HPC Challenge 1.5.0 Test / Class: EP-DGEMM GCC 10 GCC 10 - PGO 8 16 24 32 40 SE +/- 0.38, N = 3 SE +/- 0.06, N = 3 32.93 32.68 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GB/s, More Is Better HPC Challenge 1.5.0 Test / Class: G-Ptrans GCC 10 GCC 10 - PGO 1.2429 2.4858 3.7287 4.9716 6.2145 SE +/- 0.00581, N = 3 SE +/- 0.03180, N = 3 5.47737 5.52421 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GB/s, More Is Better HPC Challenge 1.5.0 Test / Class: EP-STREAM Triad GCC 10 GCC 10 - PGO 0.4044 0.8088 1.2132 1.6176 2.022 SE +/- 0.00127, N = 3 SE +/- 0.00146, N = 3 1.79750 1.79549 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GUP/s, More Is Better HPC Challenge 1.5.0 Test / Class: G-Random Access GCC 10 GCC 10 - PGO 0.0321 0.0642 0.0963 0.1284 0.1605 SE +/- 0.00039, N = 3 SE +/- 0.00060, N = 3 0.14278 0.14161 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org usecs, Fewer Is Better HPC Challenge 1.5.0 Test / Class: Random Ring Latency GCC 10 GCC 10 - PGO 0.1032 0.2064 0.3096 0.4128 0.516 SE +/- 0.00067, N = 3 SE +/- 0.00047, N = 3 0.45863 0.45680 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org GB/s, More Is Better HPC Challenge 1.5.0 Test / Class: Random Ring Bandwidth GCC 10 GCC 10 - PGO 0.7665 1.533 2.2995 3.066 3.8325 SE +/- 0.01038, N = 3 SE +/- 0.04431, N = 3 3.40678 3.36248 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
OpenBenchmarking.org MB/s, More Is Better HPC Challenge 1.5.0 Test / Class: Max Ping Pong Bandwidth GCC 10 GCC 10 - PGO 5K 10K 15K 20K 25K SE +/- 313.29, N = 3 SE +/- 58.02, N = 3 22977.00 23248.09 1. (CC) gcc options: -lblas -lm -pthread -lmpi -fomit-frame-pointer -funroll-loops 2. ATLAS + Open MPI 3.1.3
FFTW FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Stock - Size: 1D FFT Size 32 GCC 10 2K 4K 6K 8K 10K SE +/- 16.77, N = 3 10443 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Stock - Size: 2D FFT Size 32 GCC 10 2K 4K 6K 8K 10K SE +/- 11.02, N = 3 10512 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Stock - Size: 2D FFT Size 4096 GCC 10 1400 2800 4200 5600 7000 SE +/- 7.80, N = 3 6687.3 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Float + SSE - Size: 1D FFT Size 32 GCC 10 3K 6K 9K 12K 15K SE +/- 15.37, N = 3 15396 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Float + SSE - Size: 2D FFT Size 32 GCC 10 10K 20K 30K 40K 50K SE +/- 56.20, N = 3 45404 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
OpenBenchmarking.org Mflops, More Is Better FFTW 3.3.6 Build: Float + SSE - Size: 2D FFT Size 4096 GCC 10 5K 10K 15K 20K 25K SE +/- 285.77, N = 3 22667 1. (CC) gcc options: -pthread -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math -lm
Timed MrBayes Analysis This test performs a bayesian analysis of a set of primate genome sequences in order to estimate their phylogeny. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Seconds, Fewer Is Better Timed MrBayes Analysis 3.2.7 Primate Phylogeny Analysis GCC 10 16 32 48 64 80 SE +/- 0.25, N = 3 70.01 1. (CC) gcc options: -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4a -msha -maes -mavx -mfma -mavx2 -mrdrnd -mbmi -mbmi2 -madx -mabm -O3 -std=c99 -pedantic -lm
QMCPACK QMCPACK is a modern high-performance open-source Quantum Monte Carlo (QMC) simulation code making use of MPI for this benchmark of the H20 example code. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Total Execution Time - Seconds, Fewer Is Better QMCPACK 3.8 GCC 10 GCC 10 - PGO 400 800 1200 1600 2000 1878 1896 -fprofile-correction 1. (CXX) g++ options: -fopenmp -fomit-frame-pointer -finline-limit=1000 -fstrict-aliasing -funroll-all-loops -march=native -O3 -ffast-math -lm
TSCP This is a performance test of TSCP, Tom Kerrigan's Simple Chess Program, which has a built-in performance benchmark. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Nodes Per Second, More Is Better TSCP 1.81 AI Chess Performance GCC 10 GCC 10 - PGO 300K 600K 900K 1200K 1500K SE +/- 1472.68, N = 5 SE +/- 852.40, N = 5 1346651 1533348 -fprofile-correction 1. (CC) gcc options: -O3 -march=native
MKL-DNN DNNL This is a test of the Intel MKL-DNN (DNNL / Deep Neural Network Library) as an Intel-optimized library for Deep Neural Networks and making use of its built-in benchdnn functionality. The result is the total perf time reported. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org ms, Fewer Is Better MKL-DNN DNNL 1.1 Harness: Deconvolution Batch deconv_1d - Data Type: f32 GCC 10 GCC 10 - PGO 0.5229 1.0458 1.5687 2.0916 2.6145 SE +/- 0.00388, N = 3 SE +/- 0.00905, N = 3 2.32419 2.31720 MIN: 2.26 -lm - MIN: 2.24 1. (CXX) g++ options: -O3 -march=native -std=c++11 -msse4.1 -fPIC -fopenmp -pie -lpthread -ldl
OpenBenchmarking.org ms, Fewer Is Better MKL-DNN DNNL 1.1 Harness: Convolution Batch conv_alexnet - Data Type: f32 GCC 10 GCC 10 - PGO 30 60 90 120 150 SE +/- 1.44, N = 3 SE +/- 0.70, N = 3 123.99 123.30 MIN: 121.92 -lm - MIN: 121.28 1. (CXX) g++ options: -O3 -march=native -std=c++11 -msse4.1 -fPIC -fopenmp -pie -lpthread -ldl
OpenBenchmarking.org ms, Fewer Is Better MKL-DNN DNNL 1.1 Harness: Recurrent Neural Network Training - Data Type: f32 GCC 10 GCC 10 - PGO 40 80 120 160 200 SE +/- 0.35, N = 3 SE +/- 0.82, N = 3 194.25 195.32 MIN: 192.53 -lm - MIN: 192.26 1. (CXX) g++ options: -O3 -march=native -std=c++11 -msse4.1 -fPIC -fopenmp -pie -lpthread -ldl
OpenBenchmarking.org ms, Fewer Is Better MKL-DNN DNNL 1.1 Harness: Convolution Batch conv_googlenet_v3 - Data Type: f32 GCC 10 GCC 10 - PGO 12 24 36 48 60 SE +/- 0.14, N = 3 SE +/- 0.66, N = 3 52.33 53.41 MIN: 51.48 -lm - MIN: 51.6 1. (CXX) g++ options: -O3 -march=native -std=c++11 -msse4.1 -fPIC -fopenmp -pie -lpthread -ldl
TTSIOD 3D Renderer A portable GPL 3D software renderer that supports OpenMP and Intel Threading Building Blocks with many different rendering modes. This version does not use OpenGL but is entirely CPU/software based. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org FPS, More Is Better TTSIOD 3D Renderer 2.3b Phong Rendering With Soft-Shadow Mapping GCC 10 200 400 600 800 1000 SE +/- 1.21, N = 3 938.47 1. (CXX) g++ options: -O3 -fomit-frame-pointer -ffast-math -mtune=native -flto -msse -mrecip -mfpmath=sse -msse2 -mssse3 -lSDL -fopenmp -fwhole-program -lstdc++
Stockfish This is a test of Stockfish, an advanced C++11 chess benchmark that can scale up to 128 CPU cores. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Nodes Per Second, More Is Better Stockfish 9 Total Time GCC 10 GCC 10 - PGO 20M 40M 60M 80M 100M SE +/- 526550.53, N = 3 SE +/- 536300.93, N = 3 79359613 78501983 1. (CXX) g++ options: -m64 -lpthread -fno-exceptions -std=c++11 -pedantic -O3 -msse -msse3 -mpopcnt -flto
OpenSSL OpenSSL is an open-source toolkit that implements SSL (Secure Sockets Layer) and TLS (Transport Layer Security) protocols. This test measures the RSA 4096-bit performance of OpenSSL. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Signs Per Second, More Is Better OpenSSL 1.1.1 RSA 4096-bit Performance GCC 10 GCC 10 - PGO 1500 3000 4500 6000 7500 SE +/- 21.70, N = 3 SE +/- 20.62, N = 3 7180.6 7072.4 -O3 -lssl 1. (CC) gcc options: -pthread -m64 -lcrypto -ldl
OpenBenchmarking.org Million Grid Points Per Second, More Is Better ASKAP 2018-11-10 Test: tConvolve MT - Degridding GCC 10 GCC 10 - PGO 700 1400 2100 2800 3500 SE +/- 3.58, N = 3 SE +/- 2.69, N = 3 3359.12 3359.70 1. (CXX) g++ options: -lpthread
OpenBenchmarking.org Million Grid Points Per Second, More Is Better ASKAP 2018-11-10 Test: tConvolve OpenMP - Gridding GCC 10 GCC 10 - PGO 1200 2400 3600 4800 6000 SE +/- 0.00, N = 3 SE +/- 36.23, N = 3 5433.80 5361.35 1. (CXX) g++ options: -lpthread
OpenBenchmarking.org Million Grid Points Per Second, More Is Better ASKAP 2018-11-10 Test: tConvolve OpenMP - Degridding GCC 10 GCC 10 - PGO 900 1800 2700 3600 4500 SE +/- 0.00, N = 3 SE +/- 53.24, N = 3 4096.25 3995.25 1. (CXX) g++ options: -lpthread
OpenBenchmarking.org TPS, More Is Better PostgreSQL pgbench 12.0 Scaling: Buffer Test - Test: Heavy Contention - Mode: Read Only GCC 10 140K 280K 420K 560K 700K SE +/- 4887.19, N = 3 676349.71 1. (CC) gcc options: -fno-strict-aliasing -fwrapv -O2 -lpgcommon -lpgport -lpq -lpthread -lrt -lcrypt -ldl -lm
Facebook RocksDB This is a benchmark of Facebook's RocksDB as an embeddable persistent key-value store for fast storage based on Google's LevelDB. Learn more via the OpenBenchmarking.org test page.
OpenBenchmarking.org Op/s, More Is Better Facebook RocksDB 6.3.6 Test: Random Fill GCC 10 GCC 10 - PGO 200K 400K 600K 800K 1000K SE +/- 16043.44, N = 3 SE +/- 1253.44, N = 3 938039 921081 1. (CXX) g++ options: -O3 -march=native -std=c++11 -fno-builtin-memcmp -fno-rtti -rdynamic -lpthread
OpenBenchmarking.org Op/s, More Is Better Facebook RocksDB 6.3.6 Test: Random Read GCC 10 GCC 10 - PGO 30M 60M 90M 120M 150M SE +/- 1800355.11, N = 3 SE +/- 727018.18, N = 3 145207827 144768694 1. (CXX) g++ options: -O3 -march=native -std=c++11 -fno-builtin-memcmp -fno-rtti -rdynamic -lpthread
OpenBenchmarking.org Op/s, More Is Better Facebook RocksDB 6.3.6 Test: Sequential Fill GCC 10 GCC 10 - PGO 200K 400K 600K 800K 1000K SE +/- 3135.06, N = 3 SE +/- 3754.87, N = 3 1019862 1020657 1. (CXX) g++ options: -O3 -march=native -std=c++11 -fno-builtin-memcmp -fno-rtti -rdynamic -lpthread
OpenBenchmarking.org Op/s, More Is Better Facebook RocksDB 6.3.6 Test: Random Fill Sync GCC 10 GCC 10 - PGO 5K 10K 15K 20K 25K SE +/- 19.92, N = 3 SE +/- 28.62, N = 3 24588 24588 1. (CXX) g++ options: -O3 -march=native -std=c++11 -fno-builtin-memcmp -fno-rtti -rdynamic -lpthread
OpenBenchmarking.org Op/s, More Is Better Facebook RocksDB 6.3.6 Test: Read While Writing GCC 10 GCC 10 - PGO 1000K 2000K 3000K 4000K 5000K SE +/- 20082.88, N = 3 SE +/- 26586.51, N = 3 4889956 4868440 1. (CXX) g++ options: -O3 -march=native -std=c++11 -fno-builtin-memcmp -fno-rtti -rdynamic -lpthread
GCC 10 Compiler Notes: --disable-multilib --enable-checking=releaseDisk Notes: NONE / errors=remount-ro,relatime,rwProcessor Notes: Scaling Governor: acpi-cpufreq ondemand - CPU Microcode: 0x8301025Security Notes: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl and seccomp + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Full AMD retpoline IBPB: conditional STIBP: conditional RSB filling + tsx_async_abort: Not affected
Testing initiated at 21 December 2019 18:23 by user pts.
GCC 10 - PGO Processor: AMD Ryzen Threadripper 3960X 24-Core @ 3.80GHz (24 Cores / 48 Threads), Motherboard: MSI Creator TRX40 (MS-7C59) v1.0 (1.12N1 BIOS), Chipset: AMD Starship/Matisse, Memory: 32768MB, Disk: 1000GB Sabrent Rocket 4.0 1TB, Graphics: Gigabyte AMD Radeon 540/540X/550/550X / RX 540X/550/550X 2GB (1206/1750MHz), Audio: AMD Baffin HDMI/DP, Monitor: ASUS VP28U, Network: Aquantia AQC107 NBase-T/IEEE + Intel I211 + Intel Device 2723
OS: Ubuntu 19.10, Kernel: 5.4.0-nvme-hwmon (x86_64), Desktop: GNOME Shell 3.34.1, Display Server: X Server 1.20.5, Display Driver: modesetting 1.20.5, OpenGL: 4.5 Mesa 19.2.1 (LLVM 9.0.0), Compiler: GCC 10.0.0 20191208, File-System: ext4, Screen Resolution: 3840x2160
Compiler Notes: --disable-multilib --enable-checking=releaseDisk Notes: NONE / errors=remount-ro,relatime,rwProcessor Notes: Scaling Governor: acpi-cpufreq ondemand - CPU Microcode: 0x8301025Security Notes: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl and seccomp + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Full AMD retpoline IBPB: conditional STIBP: conditional RSB filling + tsx_async_abort: Not affected
Testing initiated at 22 December 2019 06:33 by user pts.