onnx gh200

ARMv8 Neoverse-V2 testing with a Quanta Cloud QuantaGrid S74G-2U 1S7GZ9Z0000 S7G MB (CG1) (3A06 BIOS) and ASPEED on Ubuntu 23.10 via the Phoronix Test Suite.

HTML result view exported from: https://openbenchmarking.org/result/2402042-NE-ONNXGH20009&rdt.

onnx gh200ProcessorMotherboardMemoryDiskGraphicsNetworkOSKernelCompilerFile-SystemScreen ResolutionabcdARMv8 Neoverse-V2 @ 3.39GHz (72 Cores)Quanta Cloud QuantaGrid S74G-2U 1S7GZ9Z0000 S7G MB (CG1) (3A06 BIOS)1 x 480GB DRAM-6400MT/s960GB SAMSUNG MZ1L2960HCJR-00A07 + 1920GB SAMSUNG MZTL21T9ASPEED2 x Mellanox MT2910 + 2 x QLogic FastLinQ QL41000 10/25/40/50GbEUbuntu 23.106.5.0-15-generic (aarch64)GCC 13.2.0ext41920x1200OpenBenchmarking.orgKernel Details- Transparent Huge Pages: madviseCompiler Details- --build=aarch64-linux-gnu --disable-libquadmath --disable-libquadmath-support --disable-werror --enable-bootstrap --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-fix-cortex-a53-843419 --enable-gnu-unique-object --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --enable-libphobos-checking=release --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-link-serialization=2 --enable-multiarch --enable-nls --enable-objc-gc=auto --enable-plugin --enable-shared --enable-threads=posix --host=aarch64-linux-gnu --program-prefix=aarch64-linux-gnu- --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --with-default-libstdcxx-abi=new --with-gcc-major-version-only --with-target-system-zlib=auto -v Processor Details- Scaling Governor: cppc_cpufreq ondemand (Boost: Disabled)Python Details- Python 3.11.6Security Details- gather_data_sampling: Not affected + itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + retbleed: Not affected + spec_rstack_overflow: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl + spectre_v1: Mitigation of __user pointer sanitization + spectre_v2: Not affected + srbds: Not affected + tsx_async_abort: Not affected

onnx gh200onnx: yolov4 - CPU - Parallelonnx: yolov4 - CPU - Parallelonnx: yolov4 - CPU - Standardonnx: yolov4 - CPU - Standardonnx: T5 Encoder - CPU - Parallelonnx: T5 Encoder - CPU - Parallelonnx: T5 Encoder - CPU - Standardonnx: T5 Encoder - CPU - Standardonnx: CaffeNet 12-int8 - CPU - Parallelonnx: CaffeNet 12-int8 - CPU - Parallelonnx: CaffeNet 12-int8 - CPU - Standardonnx: CaffeNet 12-int8 - CPU - Standardonnx: fcn-resnet101-11 - CPU - Parallelonnx: fcn-resnet101-11 - CPU - Parallelonnx: fcn-resnet101-11 - CPU - Standardonnx: fcn-resnet101-11 - CPU - Standardonnx: ResNet50 v1-12-int8 - CPU - Parallelonnx: ResNet50 v1-12-int8 - CPU - Parallelonnx: ResNet50 v1-12-int8 - CPU - Standardonnx: ResNet50 v1-12-int8 - CPU - Standardonnx: super-resolution-10 - CPU - Parallelonnx: super-resolution-10 - CPU - Parallelonnx: super-resolution-10 - CPU - Standardonnx: super-resolution-10 - CPU - Standardonnx: Faster R-CNN R-50-FPN-int8 - CPU - Standardabcd8.36597119.53011.068790.350175.241713.2889367.9632.71650548.9651.820091112.690.8977531.47621677.7511.48485673.470241.4284.14119272.6813.6666843.370323.0604149.5456.685398.30820120.42211.185189.404174.163913.4822374.2072.67306532.2611.877091125.550.8876221.47745676.8741.49458669.125241.1014.14668277.4943.6030242.773923.3815148.7146.722208.6564115.51811.368587.954876.260513.1099374.2022.66641539.311.852411164.060.8580871.52538655.5671.4664681.935246.9444.04811290.323.4431241.691723.9848148.2996.741738.34625119.8111.410887.62878.589212.7215344.5622.89589536.9151.860681154.690.8650531.48956671.3351.48493673.423245.0864.07887284.013.5196942.856723.3328149.6766.67817OpenBenchmarking.org

ONNX Runtime

Model: yolov4 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: yolov4 - Device: CPU - Executor: Parallelabcd246810SE +/- 0.02669, N = 3SE +/- 0.05551, N = 138.365978.308208.656408.346251. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: yolov4 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: yolov4 - Device: CPU - Executor: Parallelabcd306090120150SE +/- 0.38, N = 3SE +/- 0.79, N = 13119.53120.42115.52119.811. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: yolov4 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: yolov4 - Device: CPU - Executor: Standardabcd3691215SE +/- 0.10, N = 3SE +/- 0.07, N = 311.0711.1911.3711.411. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: yolov4 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: yolov4 - Device: CPU - Executor: Standardabcd20406080100SE +/- 0.78, N = 3SE +/- 0.59, N = 390.3589.4087.9587.631. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: T5 Encoder - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: T5 Encoder - Device: CPU - Executor: Parallelabcd20406080100SE +/- 0.60, N = 3SE +/- 0.58, N = 375.2474.1676.2678.591. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: T5 Encoder - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: T5 Encoder - Device: CPU - Executor: Parallelabcd3691215SE +/- 0.10, N = 3SE +/- 0.10, N = 313.2913.4813.1112.721. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: T5 Encoder - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: T5 Encoder - Device: CPU - Executor: Standardabcd80160240320400SE +/- 4.18, N = 15SE +/- 4.76, N = 15367.96374.21374.20344.561. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: T5 Encoder - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: T5 Encoder - Device: CPU - Executor: Standardabcd0.65161.30321.95482.60643.258SE +/- 0.03168, N = 15SE +/- 0.03607, N = 152.716502.673062.666412.895891. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallelabcd120240360480600SE +/- 3.89, N = 3SE +/- 2.89, N = 3548.97532.26539.31536.921. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: CaffeNet 12-int8 - Device: CPU - Executor: Parallelabcd0.42230.84461.26691.68922.1115SE +/- 0.01302, N = 3SE +/- 0.01025, N = 31.820091.877091.852411.860681. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: CaffeNet 12-int8 - Device: CPU - Executor: Standardabcd30060090012001500SE +/- 6.25, N = 3SE +/- 10.31, N = 31112.691125.551164.061154.691. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: CaffeNet 12-int8 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: CaffeNet 12-int8 - Device: CPU - Executor: Standardabcd0.2020.4040.6060.8081.01SE +/- 0.005008, N = 3SE +/- 0.008022, N = 30.8977530.8876220.8580870.8650531. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: fcn-resnet101-11 - Device: CPU - Executor: Parallelabcd0.34320.68641.02961.37281.716SE +/- 0.01346, N = 7SE +/- 0.00765, N = 31.476211.477451.525381.489561. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: fcn-resnet101-11 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: fcn-resnet101-11 - Device: CPU - Executor: Parallelabcd150300450600750SE +/- 6.29, N = 7SE +/- 3.51, N = 3677.75676.87655.57671.341. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: fcn-resnet101-11 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: fcn-resnet101-11 - Device: CPU - Executor: Standardabcd0.33630.67261.00891.34521.6815SE +/- 0.00409, N = 3SE +/- 0.00907, N = 31.484851.494581.466401.484931. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: fcn-resnet101-11 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: fcn-resnet101-11 - Device: CPU - Executor: Standardabcd150300450600750SE +/- 1.86, N = 3SE +/- 4.04, N = 3673.47669.13681.94673.421. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallelabcd50100150200250SE +/- 1.89, N = 3SE +/- 1.85, N = 3241.43241.10246.94245.091. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Parallelabcd0.9331.8662.7993.7324.665SE +/- 0.03271, N = 3SE +/- 0.03166, N = 34.141194.146684.048114.078871. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standardabcd60120180240300SE +/- 2.85, N = 3SE +/- 3.00, N = 3272.68277.49290.32284.011. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: ResNet50 v1-12-int8 - Device: CPU - Executor: Standardabcd0.8251.652.4753.34.125SE +/- 0.03802, N = 3SE +/- 0.03859, N = 33.666683.603023.443123.519691. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: super-resolution-10 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: super-resolution-10 - Device: CPU - Executor: Parallelabcd1020304050SE +/- 0.40, N = 3SE +/- 0.37, N = 343.3742.7741.6942.861. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: super-resolution-10 - Device: CPU - Executor: Parallel

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: super-resolution-10 - Device: CPU - Executor: Parallelabcd612182430SE +/- 0.21, N = 3SE +/- 0.20, N = 323.0623.3823.9823.331. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: super-resolution-10 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.17Model: super-resolution-10 - Device: CPU - Executor: Standardabcd306090120150SE +/- 0.17, N = 3SE +/- 0.57, N = 3149.55148.71148.30149.681. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

ONNX Runtime

Model: super-resolution-10 - Device: CPU - Executor: Standard

OpenBenchmarking.orgInference Time Cost (ms), Fewer Is BetterONNX Runtime 1.17Model: super-resolution-10 - Device: CPU - Executor: Standardabcd246810SE +/- 0.00749, N = 3SE +/- 0.02580, N = 36.685396.722206.741736.678171. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt


Phoronix Test Suite v10.8.4