onnx gh200 ARMv8 Neoverse-V2 testing with a Quanta Cloud QuantaGrid S74G-2U 1S7GZ9Z0000 S7G MB (CG1) (3A06 BIOS) and ASPEED on Ubuntu 23.10 via the Phoronix Test Suite.
HTML result view exported from: https://openbenchmarking.org/result/2402042-NE-ONNXGH20009&grr&sor .
onnx gh200 Processor Motherboard Memory Disk Graphics Network OS Kernel Compiler File-System Screen Resolution a b c d ARMv8 Neoverse-V2 @ 3.39GHz (72 Cores) Quanta Cloud QuantaGrid S74G-2U 1S7GZ9Z0000 S7G MB (CG1) (3A06 BIOS) 1 x 480GB DRAM-6400MT/s 960GB SAMSUNG MZ1L2960HCJR-00A07 + 1920GB SAMSUNG MZTL21T9 ASPEED 2 x Mellanox MT2910 + 2 x QLogic FastLinQ QL41000 10/25/40/50GbE Ubuntu 23.10 6.5.0-15-generic (aarch64) GCC 13.2.0 ext4 1920x1200 OpenBenchmarking.org Kernel Details - Transparent Huge Pages: madvise Compiler Details - --build=aarch64-linux-gnu --disable-libquadmath --disable-libquadmath-support --disable-werror --enable-bootstrap --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-fix-cortex-a53-843419 --enable-gnu-unique-object --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --enable-libphobos-checking=release --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-link-serialization=2 --enable-multiarch --enable-nls --enable-objc-gc=auto --enable-plugin --enable-shared --enable-threads=posix --host=aarch64-linux-gnu --program-prefix=aarch64-linux-gnu- --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --with-default-libstdcxx-abi=new --with-gcc-major-version-only --with-target-system-zlib=auto -v Processor Details - Scaling Governor: cppc_cpufreq ondemand (Boost: Disabled) Python Details - Python 3.11.6 Security Details - gather_data_sampling: Not affected + itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + retbleed: Not affected + spec_rstack_overflow: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl + spectre_v1: Mitigation of __user pointer sanitization + spectre_v2: Not affected + srbds: Not affected + tsx_async_abort: Not affected
onnx gh200 onnx: T5 Encoder - CPU - Standard onnx: T5 Encoder - CPU - Standard onnx: yolov4 - CPU - Parallel onnx: yolov4 - CPU - Parallel onnx: fcn-resnet101-11 - CPU - Parallel onnx: fcn-resnet101-11 - CPU - Parallel onnx: fcn-resnet101-11 - CPU - Standard onnx: fcn-resnet101-11 - CPU - Standard onnx: T5 Encoder - CPU - Parallel onnx: T5 Encoder - CPU - Parallel onnx: yolov4 - CPU - Standard onnx: yolov4 - CPU - Standard onnx: CaffeNet 12-int8 - CPU - Parallel onnx: CaffeNet 12-int8 - CPU - Parallel onnx: CaffeNet 12-int8 - CPU - Standard onnx: CaffeNet 12-int8 - CPU - Standard onnx: ResNet50 v1-12-int8 - CPU - Parallel onnx: ResNet50 v1-12-int8 - CPU - Parallel onnx: ResNet50 v1-12-int8 - CPU - Standard onnx: ResNet50 v1-12-int8 - CPU - Standard onnx: super-resolution-10 - CPU - Parallel onnx: super-resolution-10 - CPU - Parallel onnx: super-resolution-10 - CPU - Standard onnx: super-resolution-10 - CPU - Standard onnx: Faster R-CNN R-50-FPN-int8 - CPU - Parallel a b c d 2.71650 367.963 119.530 8.36597 677.751 1.47621 673.470 1.48485 13.2889 75.2417 90.3501 11.0687 1.82009 548.965 0.897753 1112.69 4.14119 241.428 3.66668 272.681 23.0604 43.3703 6.68539 149.545 2.67306 374.207 120.422 8.30820 676.874 1.47745 669.125 1.49458 13.4822 74.1639 89.4041 11.1851 1.87709 532.261 0.887622 1125.55 4.14668 241.101 3.60302 277.494 23.3815 42.7739 6.72220 148.714 2.66641 374.202 115.518 8.6564 655.567 1.52538 681.935 1.4664 13.1099 76.2605 87.9548 11.3685 1.85241 539.31 0.858087 1164.06 4.04811 246.944 3.44312 290.32 23.9848 41.6917 6.74173 148.299 2.89589 344.562 119.81 8.34625 671.335 1.48956 673.423 1.48493 12.7215 78.5892 87.628 11.4108 1.86068 536.915 0.865053 1154.69 4.07887 245.086 3.51969 284.01 23.3328 42.8567 6.67817 149.676 OpenBenchmarking.org
ONNX Runtime Model: T5 Encoder - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: T5 Encoder - Device: CPU - Executor: Standard c b a d 0.6516 1.3032 1.9548 2.6064 3.258 SE +/- 0.03607, N = 15 SE +/- 0.03168, N = 15 2.66641 2.67306 2.71650 2.89589 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: T5 Encoder - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: T5 Encoder - Device: CPU - Executor: Standard b c a d 80 160 240 320 400 SE +/- 4.76, N = 15 SE +/- 4.18, N = 15 374.21 374.20 367.96 344.56 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: yolov4 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: yolov4 - Device: CPU - Executor: Parallel c a d b 30 60 90 120 150 SE +/- 0.38, N = 3 SE +/- 0.79, N = 13 115.52 119.53 119.81 120.42 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: yolov4 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: yolov4 - Device: CPU - Executor: Parallel c a d b 2 4 6 8 10 SE +/- 0.02669, N = 3 SE +/- 0.05551, N = 13 8.65640 8.36597 8.34625 8.30820 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel c d b a 150 300 450 600 750 SE +/- 3.51, N = 3 SE +/- 6.29, N = 7 655.57 671.34 676.87 677.75 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel c d b a 0.3432 0.6864 1.0296 1.3728 1.716 SE +/- 0.00765, N = 3 SE +/- 0.01346, N = 7 1.52538 1.48956 1.47745 1.47621 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: fcn-resnet101-11 - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: fcn-resnet101-11 - Device: CPU - Executor: Standard b d a c 150 300 450 600 750 SE +/- 4.04, N = 3 SE +/- 1.86, N = 3 669.13 673.42 673.47 681.94 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: fcn-resnet101-11 - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: fcn-resnet101-11 - Device: CPU - Executor: Standard b d a c 0.3363 0.6726 1.0089 1.3452 1.6815 SE +/- 0.00907, N = 3 SE +/- 0.00409, N = 3 1.49458 1.48493 1.48485 1.46640 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: T5 Encoder - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: T5 Encoder - Device: CPU - Executor: Parallel d c a b 3 6 9 12 15 SE +/- 0.10, N = 3 SE +/- 0.10, N = 3 12.72 13.11 13.29 13.48 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: T5 Encoder - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: T5 Encoder - Device: CPU - Executor: Parallel d c a b 20 40 60 80 100 SE +/- 0.60, N = 3 SE +/- 0.58, N = 3 78.59 76.26 75.24 74.16 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: yolov4 - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: yolov4 - Device: CPU - Executor: Standard d c b a 20 40 60 80 100 SE +/- 0.59, N = 3 SE +/- 0.78, N = 3 87.63 87.95 89.40 90.35 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: yolov4 - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: yolov4 - Device: CPU - Executor: Standard d c b a 3 6 9 12 15 SE +/- 0.07, N = 3 SE +/- 0.10, N = 3 11.41 11.37 11.19 11.07 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel a c d b 0.4223 0.8446 1.2669 1.6892 2.1115 SE +/- 0.01302, N = 3 SE +/- 0.01025, N = 3 1.82009 1.85241 1.86068 1.87709 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel a c d b 120 240 360 480 600 SE +/- 3.89, N = 3 SE +/- 2.89, N = 3 548.97 539.31 536.92 532.26 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard c d b a 0.202 0.404 0.606 0.808 1.01 SE +/- 0.008022, N = 3 SE +/- 0.005008, N = 3 0.858087 0.865053 0.887622 0.897753 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard c d b a 300 600 900 1200 1500 SE +/- 10.31, N = 3 SE +/- 6.25, N = 3 1164.06 1154.69 1125.55 1112.69 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel c d a b 0.933 1.866 2.799 3.732 4.665 SE +/- 0.03271, N = 3 SE +/- 0.03166, N = 3 4.04811 4.07887 4.14119 4.14668 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel c d a b 50 100 150 200 250 SE +/- 1.89, N = 3 SE +/- 1.85, N = 3 246.94 245.09 241.43 241.10 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard c d b a 0.825 1.65 2.475 3.3 4.125 SE +/- 0.03859, N = 3 SE +/- 0.03802, N = 3 3.44312 3.51969 3.60302 3.66668 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard c d b a 60 120 180 240 300 SE +/- 3.00, N = 3 SE +/- 2.85, N = 3 290.32 284.01 277.49 272.68 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: super-resolution-10 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: super-resolution-10 - Device: CPU - Executor: Parallel a d b c 6 12 18 24 30 SE +/- 0.21, N = 3 SE +/- 0.20, N = 3 23.06 23.33 23.38 23.98 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: super-resolution-10 - Device: CPU - Executor: Parallel OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: super-resolution-10 - Device: CPU - Executor: Parallel a d b c 10 20 30 40 50 SE +/- 0.40, N = 3 SE +/- 0.37, N = 3 43.37 42.86 42.77 41.69 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: super-resolution-10 - Device: CPU - Executor: Standard OpenBenchmarking.org Inference Time Cost (ms), Fewer Is Better ONNX Runtime 1.17 Model: super-resolution-10 - Device: CPU - Executor: Standard d a b c 2 4 6 8 10 SE +/- 0.00749, N = 3 SE +/- 0.02580, N = 3 6.67817 6.68539 6.72220 6.74173 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
ONNX Runtime Model: super-resolution-10 - Device: CPU - Executor: Standard OpenBenchmarking.org Inferences Per Second, More Is Better ONNX Runtime 1.17 Model: super-resolution-10 - Device: CPU - Executor: Standard d a b c 30 60 90 120 150 SE +/- 0.17, N = 3 SE +/- 0.57, N = 3 149.68 149.55 148.71 148.30 1. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt
Phoronix Test Suite v10.8.5