AMD EPYC Turin AI/ML Tuning Guide

AMD EPYC 9655P following AMD tuning guide for AI/ML workloads - https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58467_amd-epyc-9005-tg-bios-and-workload.pdf Benchmarks by Michael Larabel for a future article.

Compare your own system(s) to this result file with the Phoronix Test Suite by running the command: phoronix-test-suite benchmark 2411286-NE-AMDEPYCTU24
Jump To Table - Results

View

Do Not Show Noisy Results
Do Not Show Results With Incomplete Data
Do Not Show Results With Little Change/Spread
List Notable Results
Show Result Confidence Charts
Allow Limiting Results To Certain Suite(s)

Statistics

Show Overall Harmonic Mean(s)
Show Overall Geometric Mean
Show Wins / Losses Counts (Pie Chart)
Normalize Results
Remove Outliers Before Calculating Averages

Graph Settings

Force Line Graphs Where Applicable
Convert To Scalar Where Applicable
Prefer Vertical Bar Graphs
No Box Plots
On Line Graphs With Missing Data, Connect The Line Gaps

Multi-Way Comparison

Condense Multi-Option Tests Into Single Result Graphs
Condense Test Profiles With Multiple Version Results Into Single Result Graphs

Table

Show Detailed System Result Table

Sensor Monitoring

Show Accumulated Sensor Monitoring Data For Displayed Results
Generate Power Efficiency / Performance Per Watt Results

Run Management

Highlight
Result
Toggle/Hide
Result
Result
Identifier
View Logs
Performance Per
Dollar
Date
Run
  Test
  Duration
Stock
November 28 2024
  5 Hours, 44 Minutes
AI/ML Tuning Recommendations
November 28 2024
  6 Hours, 7 Minutes
Invert Behavior (Only Show Selected Data)
  5 Hours, 55 Minutes
Only show results matching title/arguments (delimit multiple options with a comma):
Do not show results matching title/arguments (delimit multiple options with a comma):


AMD EPYC Turin AI/ML Tuning GuideOpenBenchmarking.orgPhoronix Test SuiteAMD EPYC 9655P 96-Core @ 2.60GHz (96 Cores / 192 Threads)Supermicro Super Server H13SSL-N v1.01 (3.0 BIOS)AMD 1Ah12 x 64GB DDR5-6000MT/s Micron MTC40F2046S1RC64BDY QSFF3201GB Micron_7450_MTFDKCB3T2TFSASPEED2 x Broadcom NetXtreme BCM5720 PCIeUbuntu 24.106.12.0-rc7-linux-pm-next-phx (x86_64)GNOME Shell 47.0X ServerGCC 14.2.0ext41024x768ProcessorMotherboardChipsetMemoryDiskGraphicsNetworkOSKernelDesktopDisplay ServerCompilerFile-SystemScreen ResolutionAMD EPYC Turin AI/ML Tuning Guide BenchmarksSystem Logs- Transparent Huge Pages: madvise- --build=x86_64-linux-gnu --disable-vtable-verify --disable-werror --enable-bootstrap --enable-cet --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-gnu-unique-object --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust --enable-libphobos-checking=release --enable-libstdcxx-backtrace --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-link-serialization=2 --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc=auto --enable-offload-defaulted --enable-offload-targets=nvptx-none=/build/gcc-14-zdkDXv/gcc-14-14.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-14-zdkDXv/gcc-14-14.2.0/debian/tmp-gcn/usr --enable-plugin --enable-shared --enable-threads=posix --host=x86_64-linux-gnu --program-prefix=x86_64-linux-gnu- --target=x86_64-linux-gnu --with-abi=m64 --with-arch-32=i686 --with-build-config=bootstrap-lto-lean --with-default-libstdcxx-abi=new --with-gcc-major-version-only --with-multilib-list=m32,m64,mx32 --with-target-system-zlib=auto --with-tune=generic --without-cuda-driver -v - Scaling Governor: acpi-cpufreq performance (Boost: Enabled) - CPU Microcode: 0xb002116 - Python 3.12.7- gather_data_sampling: Not affected + itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + reg_file_data_sampling: Not affected + retbleed: Not affected + spec_rstack_overflow: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Enhanced / Automatic IBRS; IBPB: conditional; STIBP: always-on; RSB filling; PBRSB-eIBRS: Not affected; BHI: Not affected + srbds: Not affected + tsx_async_abort: Not affected

Stock vs. AI/ML Tuning Recommendations ComparisonPhoronix Test SuiteBaseline+2.6%+2.6%+5.2%+5.2%+7.8%+7.8%10.3%10%6.5%6.4%6.4%6.3%6.2%6.2%6.2%6.1%6.1%5.8%5.6%5.4%5.3%5.1%5.1%4.9%4.9%4.7%4.7%4.7%4.7%4.6%4.5%4.5%4.2%4.2%3.6%3.3%3.3%3.3%3.3%3%2.9%2.8%2.7%2.7%2.6%2.5%2.5%2.4%2.3%2.2%2.1%2%2%2%3.3%4.1%W.P.D.F.I - CPUW.P.D.F.I - CPUP.V.B.D.F - CPUNASNet MobileCPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - P.P.5P.V.B.D.F - CPUFP16MobileNetV3SmallF.D.R.F.I - CPUF.D.R.F.I - CPUD.B.s - CPUC.B.S.A - CPUCPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - P.P.2IP Shapes 1D - CPUCPU - 512 - ResNet-152R.N.N.I - CPUV.D.F.I - CPUV.D.F.I - CPUCPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - P.P.1CPU - 256 - ResNet-152R.N.N.T - CPUCPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - P.P.5A.G.R.R.0.F - CPUCPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - P.P.1R.S.A.F.I - CPUIP Shapes 3D - CPUR.S.A.F.I - CPUResNet101_DUC_HDC-12 - CPU - StandardA.G.R.R.0.F - CPUCPU - 512 - ResNet-50M.T.E.T.D.F - CPUM.T.E.T.D.F - CPUR.v.1.i - CPU - Standardggml-small.en - 2.S.o.t.USmallCPU - 256 - ResNet-50CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - T.G.1P.R.I.R.F - CPUFP16MobileNetV1P.R.I.R.F - CPUH.E.R.F.I - CPUH.E.R.F.I - CPUMobilenet FloatQS8MobileNetV2N.S.P.L.F - CPUN.S.P.L.F - CPUP.D.F - CPUP.D.F - CPUPhi-3-mini-128k-instruct-int4-ov - CPU - T.T.F.T6.2%Falcon-7b-instruct-int4-ov - CPU - T.T.F.T5.6%Gemma-7b-int4-ov - CPU - T.T.F.TR.v.1.i - CPU - StandardResNet101_DUC_HDC-12 - CPU - StandardOpenVINOOpenVINOOpenVINOLiteRTLlama.cppOpenVINOXNNPACKOpenVINOOpenVINOoneDNNoneDNNLlama.cpponeDNNPyTorchoneDNNOpenVINOOpenVINOLlama.cppPyTorchoneDNNLlama.cppOpenVINOLlama.cppOpenVINOoneDNNOpenVINOONNX RuntimeOpenVINOPyTorchOpenVINOOpenVINOONNX RuntimeWhisper.cppWhisperfilePyTorchLlama.cppOpenVINOXNNPACKOpenVINOOpenVINOOpenVINOLiteRTXNNPACKOpenVINOOpenVINOOpenVINOOpenVINOOpenVINO GenAIOpenVINO GenAIOpenVINO GenAIONNX RuntimeONNX RuntimeStockAI/ML Tuning Recommendations

AMD EPYC Turin AI/ML Tuning Guideopenvino: Age Gender Recognition Retail 0013 FP16 - CPUopenvino: Age Gender Recognition Retail 0013 FP16 - CPUopenvino: Person Detection FP16 - CPUopenvino: Person Detection FP16 - CPUopenvino: Weld Porosity Detection FP16-INT8 - CPUopenvino: Weld Porosity Detection FP16-INT8 - CPUopenvino: Vehicle Detection FP16-INT8 - CPUopenvino: Vehicle Detection FP16-INT8 - CPUopenvino: Person Vehicle Bike Detection FP16 - CPUopenvino: Person Vehicle Bike Detection FP16 - CPUopenvino: Machine Translation EN To DE FP16 - CPUopenvino: Machine Translation EN To DE FP16 - CPUopenvino: Face Detection Retail FP16-INT8 - CPUopenvino: Face Detection Retail FP16-INT8 - CPUopenvino: Handwritten English Recognition FP16-INT8 - CPUopenvino: Handwritten English Recognition FP16-INT8 - CPUopenvino: Road Segmentation ADAS FP16-INT8 - CPUopenvino: Road Segmentation ADAS FP16-INT8 - CPUopenvino: Person Re-Identification Retail FP16 - CPUopenvino: Person Re-Identification Retail FP16 - CPUopenvino: Noise Suppression Poconet-Like FP16 - CPUopenvino: Noise Suppression Poconet-Like FP16 - CPUopenvino-genai: Phi-3-mini-128k-instruct-int4-ov - CPUopenvino-genai: Phi-3-mini-128k-instruct-int4-ov - CPU - Time To First Tokenopenvino-genai: Phi-3-mini-128k-instruct-int4-ov - CPU - Time Per Output Tokenopenvino-genai: Falcon-7b-instruct-int4-ov - CPUopenvino-genai: Falcon-7b-instruct-int4-ov - CPU - Time To First Tokenopenvino-genai: Falcon-7b-instruct-int4-ov - CPU - Time Per Output Tokenopenvino-genai: Gemma-7b-int4-ov - CPUopenvino-genai: Gemma-7b-int4-ov - CPU - Time To First Tokenopenvino-genai: Gemma-7b-int4-ov - CPU - Time Per Output Tokenllama-cpp: CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Text Generation 128llama-cpp: CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 512llama-cpp: CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 1024llama-cpp: CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 2048llama-cpp: CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Text Generation 128llama-cpp: CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 512llama-cpp: CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 2048llama-cpp: CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Text Generation 128llama-cpp: CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 512llama-cpp: CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 1024llama-cpp: CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 2048whisperfile: Tinywhisperfile: Smallwhisperfile: Mediumwhisper-cpp: ggml-small.en - 2016 State of the Unionwhisper-cpp: ggml-medium.en - 2016 State of the Uniontensorflow: CPU - 256 - ResNet-50tensorflow: CPU - 512 - ResNet-50litert: Mobilenet Floatlitert: NASNet Mobilelitert: SqueezeNetlitert: Inception V4pytorch: CPU - 256 - ResNet-50pytorch: CPU - 256 - ResNet-152pytorch: CPU - 512 - ResNet-50pytorch: CPU - 512 - ResNet-152onednn: Convolution Batch Shapes Auto - CPUonednn: Deconvolution Batch shapes_1d - CPUonednn: Deconvolution Batch shapes_3d - CPUonednn: IP Shapes 1D - CPUonednn: IP Shapes 3D - CPUonednn: Recurrent Neural Network Training - CPUonednn: Recurrent Neural Network Inference - CPUnumpy: onnx: ResNet50 v1-12-int8 - CPU - Standardonnx: ResNet50 v1-12-int8 - CPU - Standardonnx: ResNet101_DUC_HDC-12 - CPU - Standardonnx: ResNet101_DUC_HDC-12 - CPU - Standardxnnpack: FP32MobileNetV1xnnpack: FP32MobileNetV2xnnpack: FP32MobileNetV3Smallxnnpack: FP16MobileNetV1xnnpack: FP16MobileNetV2xnnpack: FP16MobileNetV3Smallxnnpack: QS8MobileNetV2StockAI/ML Tuning Recommendations140497.340.45716.2066.8914035.766.728270.995.776720.987.09852.8356.2423587.023.943559.8726.942517.8918.9910525.864.536747.4813.6855.6324.1717.9851.0129.0719.6037.8436.1326.4345.8472.9996.99144.5392.82154.60306.5148.0772.497.19149.0031.8856090.67175200.48292221.62918454.28794204.33231.184438.527337377035.7443898.951.6120.7851.4820.600.3414756.708970.7184840.5358740.265564425.718276.379885.50276.0763.623416.09788165.0944539920310488463490921096610042146329.800.43730.7365.5615437.796.098691.005.497144.846.66881.2254.4325050.473.713649.6326.282630.4218.1610800.124.416891.6413.3856.4625.6617.7151.1230.7019.5638.0035.4326.3146.4576.41101.77152.9295.42155.73308.3248.7377.05101.71150.2931.4357388.03640197.24095214.62792449.70262207.38235.354335.676893966926.8843824.953.1321.7953.3421.710.3218336.654820.6770500.5073550.254123406.453262.517887.75285.1043.506976.35626158.528447190621038745139028103239813OpenBenchmarking.org

OpenVINO

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Age Gender Recognition Retail 0013 FP16 - Device: CPUStockAI/ML Tuning Recommendations30K60K90K120K150KSE +/- 313.81, N = 3SE +/- 528.15, N = 3140497.34146329.801. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Age Gender Recognition Retail 0013 FP16 - Device: CPUStockAI/ML Tuning Recommendations0.10130.20260.30390.40520.5065SE +/- 0.01, N = 3SE +/- 0.00, N = 30.450.43MIN: 0.16 / MAX: 25.14MIN: 0.15 / MAX: 23.941. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Person Detection FP16 - Device: CPUStockAI/ML Tuning Recommendations160320480640800SE +/- 0.66, N = 3SE +/- 0.74, N = 3716.20730.731. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Person Detection FP16 - Device: CPUStockAI/ML Tuning Recommendations1530456075SE +/- 0.06, N = 3SE +/- 0.07, N = 366.8965.56MIN: 34.58 / MAX: 130MIN: 32.6 / MAX: 131.971. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Weld Porosity Detection FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations3K6K9K12K15KSE +/- 9.10, N = 3SE +/- 14.35, N = 314035.7615437.791. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Weld Porosity Detection FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations246810SE +/- 0.00, N = 3SE +/- 0.00, N = 36.726.09MIN: 2.22 / MAX: 22.18MIN: 2.21 / MAX: 23.861. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Vehicle Detection FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 3.96, N = 3SE +/- 7.22, N = 38270.998691.001. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Vehicle Detection FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations1.29832.59663.89495.19326.4915SE +/- 0.00, N = 3SE +/- 0.01, N = 35.775.49MIN: 2.28 / MAX: 21.36MIN: 2.47 / MAX: 19.561. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Person Vehicle Bike Detection FP16 - Device: CPUStockAI/ML Tuning Recommendations15003000450060007500SE +/- 13.33, N = 3SE +/- 8.11, N = 36720.987144.841. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Person Vehicle Bike Detection FP16 - Device: CPUStockAI/ML Tuning Recommendations246810SE +/- 0.01, N = 3SE +/- 0.01, N = 37.096.66MIN: 4.15 / MAX: 20.16MIN: 3.65 / MAX: 22.091. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Machine Translation EN To DE FP16 - Device: CPUStockAI/ML Tuning Recommendations2004006008001000SE +/- 0.22, N = 3SE +/- 0.40, N = 3852.83881.221. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Machine Translation EN To DE FP16 - Device: CPUStockAI/ML Tuning Recommendations1326395265SE +/- 0.02, N = 3SE +/- 0.03, N = 356.2454.43MIN: 29.3 / MAX: 94.69MIN: 28.33 / MAX: 92.291. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Face Detection Retail FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations5K10K15K20K25KSE +/- 17.66, N = 3SE +/- 17.84, N = 323587.0225050.471. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Face Detection Retail FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations0.88651.7732.65953.5464.4325SE +/- 0.01, N = 3SE +/- 0.00, N = 33.943.71MIN: 1.76 / MAX: 17.65MIN: 1.71 / MAX: 19.331. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Handwritten English Recognition FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations8001600240032004000SE +/- 2.96, N = 3SE +/- 3.35, N = 33559.873649.631. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Handwritten English Recognition FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations612182430SE +/- 0.02, N = 3SE +/- 0.02, N = 326.9426.28MIN: 15.65 / MAX: 45.15MIN: 15.86 / MAX: 40.891. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Road Segmentation ADAS FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations6001200180024003000SE +/- 7.45, N = 3SE +/- 2.42, N = 32517.892630.421. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Road Segmentation ADAS FP16-INT8 - Device: CPUStockAI/ML Tuning Recommendations510152025SE +/- 0.06, N = 3SE +/- 0.02, N = 318.9918.16MIN: 9.19 / MAX: 39MIN: 7.68 / MAX: 40.381. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Person Re-Identification Retail FP16 - Device: CPUStockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 12.77, N = 3SE +/- 6.08, N = 310525.8610800.121. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Person Re-Identification Retail FP16 - Device: CPUStockAI/ML Tuning Recommendations1.01932.03863.05794.07725.0965SE +/- 0.01, N = 3SE +/- 0.00, N = 34.534.41MIN: 1.95 / MAX: 23.94MIN: 2.45 / MAX: 17.461. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgFPS, More Is BetterOpenVINO 2024.5Model: Noise Suppression Poconet-Like FP16 - Device: CPUStockAI/ML Tuning Recommendations15003000450060007500SE +/- 11.12, N = 3SE +/- 9.90, N = 36747.486891.641. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenBenchmarking.orgms, Fewer Is BetterOpenVINO 2024.5Model: Noise Suppression Poconet-Like FP16 - Device: CPUStockAI/ML Tuning Recommendations48121620SE +/- 0.02, N = 3SE +/- 0.02, N = 313.6813.38MIN: 7.09 / MAX: 36.01MIN: 6.98 / MAX: 34.621. (CXX) g++ options: -fPIC -fsigned-char -ffunction-sections -fdata-sections -O3 -fno-strict-overflow -fwrapv -shared -ldl -lstdc++fs

OpenVINO GenAI

OpenBenchmarking.orgtokens/s, More Is BetterOpenVINO GenAI 2024.5Model: Phi-3-mini-128k-instruct-int4-ov - Device: CPUStockAI/ML Tuning Recommendations1326395265SE +/- 0.16, N = 4SE +/- 0.14, N = 455.6356.46

OpenBenchmarking.orgtokens/s, More Is BetterOpenVINO GenAI 2024.5Model: Falcon-7b-instruct-int4-ov - Device: CPUStockAI/ML Tuning Recommendations1224364860SE +/- 0.09, N = 3SE +/- 0.19, N = 351.0151.12

OpenBenchmarking.orgtokens/s, More Is BetterOpenVINO GenAI 2024.5Model: Gemma-7b-int4-ov - Device: CPUStockAI/ML Tuning Recommendations918273645SE +/- 0.13, N = 3SE +/- 0.09, N = 337.8438.00

Llama.cpp

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Text Generation 128StockAI/ML Tuning Recommendations1122334455SE +/- 0.05, N = 4SE +/- 0.09, N = 445.8446.451. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 512StockAI/ML Tuning Recommendations20406080100SE +/- 0.98, N = 3SE +/- 0.99, N = 372.9976.411. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 1024StockAI/ML Tuning Recommendations20406080100SE +/- 0.34, N = 3SE +/- 1.16, N = 396.99101.771. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 2048StockAI/ML Tuning Recommendations306090120150SE +/- 0.97, N = 3SE +/- 1.38, N = 3144.53152.921. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Text Generation 128StockAI/ML Tuning Recommendations20406080100SE +/- 0.49, N = 6SE +/- 0.43, N = 692.8295.421. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 512StockAI/ML Tuning Recommendations306090120150SE +/- 2.60, N = 12SE +/- 3.23, N = 12154.60155.731. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 2048StockAI/ML Tuning Recommendations70140210280350SE +/- 2.62, N = 3SE +/- 2.23, N = 15306.51308.321. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Text Generation 128StockAI/ML Tuning Recommendations1122334455SE +/- 0.07, N = 4SE +/- 0.05, N = 448.0748.731. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 512StockAI/ML Tuning Recommendations20406080100SE +/- 0.83, N = 3SE +/- 0.77, N = 572.4077.051. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 1024StockAI/ML Tuning Recommendations20406080100SE +/- 1.13, N = 4SE +/- 0.67, N = 1597.19101.711. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

OpenBenchmarking.orgTokens Per Second, More Is BetterLlama.cpp b4154Backend: CPU BLAS - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 2048StockAI/ML Tuning Recommendations306090120150SE +/- 2.06, N = 3SE +/- 1.36, N = 15149.00150.291. (CXX) g++ options: -std=c++11 -fPIC -O3 -pthread -fopenmp -march=native -mtune=native -lopenblas

Whisperfile

OpenBenchmarking.orgSeconds, Fewer Is BetterWhisperfile 20Aug24Model Size: TinyStockAI/ML Tuning Recommendations714212835SE +/- 0.26, N = 3SE +/- 0.25, N = 331.8931.44

OpenBenchmarking.orgSeconds, Fewer Is BetterWhisperfile 20Aug24Model Size: SmallStockAI/ML Tuning Recommendations20406080100SE +/- 0.71, N = 3SE +/- 0.57, N = 390.6788.04

OpenBenchmarking.orgSeconds, Fewer Is BetterWhisperfile 20Aug24Model Size: MediumStockAI/ML Tuning Recommendations4080120160200SE +/- 0.88, N = 3SE +/- 0.71, N = 3200.48197.24

Whisper.cpp

OpenBenchmarking.orgSeconds, Fewer Is BetterWhisper.cpp 1.6.2Model: ggml-small.en - Input: 2016 State of the UnionStockAI/ML Tuning Recommendations50100150200250SE +/- 2.33, N = 3SE +/- 0.68, N = 3221.63214.631. (CXX) g++ options: -O3 -std=c++11 -fPIC -pthread -msse3 -mssse3 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi -mavx512vnni

OpenBenchmarking.orgSeconds, Fewer Is BetterWhisper.cpp 1.6.2Model: ggml-medium.en - Input: 2016 State of the UnionStockAI/ML Tuning Recommendations100200300400500SE +/- 1.45, N = 3SE +/- 1.24, N = 3454.29449.701. (CXX) g++ options: -O3 -std=c++11 -fPIC -pthread -msse3 -mssse3 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi -mavx512vnni

TensorFlow

This is a benchmark of the TensorFlow deep learning framework using the TensorFlow reference benchmarks (tensorflow/benchmarks with tf_cnn_benchmarks.py). Note with the Phoronix Test Suite there is also pts/tensorflow-lite for benchmarking the TensorFlow Lite binaries if desired for complementary metrics. Learn more via the OpenBenchmarking.org test page.

OpenBenchmarking.orgimages/sec, More Is BetterTensorFlow 2.16.1Device: CPU - Batch Size: 256 - Model: ResNet-50StockAI/ML Tuning Recommendations50100150200250SE +/- 0.57, N = 3SE +/- 0.72, N = 3204.33207.38

OpenBenchmarking.orgimages/sec, More Is BetterTensorFlow 2.16.1Device: CPU - Batch Size: 512 - Model: ResNet-50StockAI/ML Tuning Recommendations50100150200250SE +/- 0.28, N = 3SE +/- 0.20, N = 3231.18235.35

LiteRT

OpenBenchmarking.orgMicroseconds, Fewer Is BetterLiteRT 2024-10-15Model: Mobilenet FloatStockAI/ML Tuning Recommendations10002000300040005000SE +/- 11.59, N = 3SE +/- 7.66, N = 34438.524335.67

OpenBenchmarking.orgMicroseconds, Fewer Is BetterLiteRT 2024-10-15Model: NASNet MobileStockAI/ML Tuning Recommendations160K320K480K640K800KSE +/- 17324.48, N = 15SE +/- 22050.37, N = 12733737689396

OpenBenchmarking.orgMicroseconds, Fewer Is BetterLiteRT 2024-10-15Model: SqueezeNetStockAI/ML Tuning Recommendations15003000450060007500SE +/- 31.31, N = 3SE +/- 31.55, N = 37035.746926.88

OpenBenchmarking.orgMicroseconds, Fewer Is BetterLiteRT 2024-10-15Model: Inception V4StockAI/ML Tuning Recommendations9K18K27K36K45KSE +/- 47.06, N = 3SE +/- 159.42, N = 343898.943824.9

PyTorch

This is a benchmark of PyTorch making use of pytorch-benchmark [https://github.com/LukasHedegaard/pytorch-benchmark]. Learn more via the OpenBenchmarking.org test page.

OpenBenchmarking.orgbatches/sec, More Is BetterPyTorch 2.2.1Device: CPU - Batch Size: 256 - Model: ResNet-50StockAI/ML Tuning Recommendations1224364860SE +/- 0.15, N = 3SE +/- 0.28, N = 351.6153.13MIN: 45.56 / MAX: 52.57MIN: 46.57 / MAX: 54.35

OpenBenchmarking.orgbatches/sec, More Is BetterPyTorch 2.2.1Device: CPU - Batch Size: 256 - Model: ResNet-152StockAI/ML Tuning Recommendations510152025SE +/- 0.04, N = 3SE +/- 0.27, N = 320.7821.79MIN: 19.72 / MAX: 21.04MIN: 20.38 / MAX: 22.54

OpenBenchmarking.orgbatches/sec, More Is BetterPyTorch 2.2.1Device: CPU - Batch Size: 512 - Model: ResNet-50StockAI/ML Tuning Recommendations1224364860SE +/- 0.15, N = 3SE +/- 0.26, N = 351.4853.34MIN: 46.04 / MAX: 52.41MIN: 49 / MAX: 54.59

OpenBenchmarking.orgbatches/sec, More Is BetterPyTorch 2.2.1Device: CPU - Batch Size: 512 - Model: ResNet-152StockAI/ML Tuning Recommendations510152025SE +/- 0.10, N = 3SE +/- 0.18, N = 320.6021.71MIN: 19.3 / MAX: 21.02MIN: 20.13 / MAX: 22.23

oneDNN

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: Convolution Batch Shapes Auto - Engine: CPUStockAI/ML Tuning Recommendations0.07680.15360.23040.30720.384SE +/- 0.000295, N = 7SE +/- 0.001024, N = 70.3414750.321833MIN: 0.32MIN: 0.311. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: Deconvolution Batch shapes_1d - Engine: CPUStockAI/ML Tuning Recommendations246810SE +/- 0.03430, N = 3SE +/- 0.01789, N = 36.708976.65482MIN: 6.07MIN: 3.911. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: Deconvolution Batch shapes_3d - Engine: CPUStockAI/ML Tuning Recommendations0.16170.32340.48510.64680.8085SE +/- 0.001205, N = 9SE +/- 0.000482, N = 90.7184840.677050MIN: 0.62MIN: 0.581. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: IP Shapes 1D - Engine: CPUStockAI/ML Tuning Recommendations0.12060.24120.36180.48240.603SE +/- 0.001151, N = 4SE +/- 0.001097, N = 40.5358740.507355MIN: 0.49MIN: 0.461. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: IP Shapes 3D - Engine: CPUStockAI/ML Tuning Recommendations0.05980.11960.17940.23920.299SE +/- 0.000944, N = 5SE +/- 0.000491, N = 50.2655640.254123MIN: 0.24MIN: 0.241. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: Recurrent Neural Network Training - Engine: CPUStockAI/ML Tuning Recommendations90180270360450SE +/- 0.52, N = 3SE +/- 0.37, N = 3425.72406.45MIN: 419.47MIN: 400.131. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.6Harness: Recurrent Neural Network Inference - Engine: CPUStockAI/ML Tuning Recommendations60120180240300SE +/- 0.68, N = 3SE +/- 0.27, N = 3276.38262.52MIN: 269.7MIN: 257.811. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -fcf-protection=full -pie -ldl

Numpy Benchmark

This is a test to obtain the general Numpy performance. Learn more via the OpenBenchmarking.org test page.

OpenBenchmarking.orgScore, More Is BetterNumpy BenchmarkStockAI/ML Tuning Recommendations2004006008001000SE +/- 1.94, N = 3SE +/- 0.72, N = 3885.50887.75

ONNX Runtime

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.19Model: ResNet50 v1-12-int8 - Device: CPU - Executor: StandardStockAI/ML Tuning Recommendations60120180240300SE +/- 2.53, N = 7SE +/- 1.46, N = 3276.08285.101. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

OpenBenchmarking.orgInferences Per Second, More Is BetterONNX Runtime 1.19Model: ResNet101_DUC_HDC-12 - Device: CPU - Executor: StandardStockAI/ML Tuning Recommendations246810SE +/- 0.13191, N = 15SE +/- 0.14383, N = 156.097886.356261. (CXX) g++ options: -O3 -march=native -ffunction-sections -fdata-sections -mtune=native -flto=auto -fno-fat-lto-objects -ldl -lrt

XNNPACK

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP32MobileNetV1StockAI/ML Tuning Recommendations10002000300040005000SE +/- 42.62, N = 3SE +/- 54.67, N = 3453944711. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP32MobileNetV2StockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 32.13, N = 3SE +/- 25.31, N = 3920390621. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP32MobileNetV3SmallStockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 11.35, N = 3SE +/- 28.22, N = 310488103871. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP16MobileNetV1StockAI/ML Tuning Recommendations10002000300040005000SE +/- 15.14, N = 3SE +/- 10.73, N = 3463445131. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP16MobileNetV2StockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 89.20, N = 3SE +/- 94.57, N = 3909290281. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: FP16MobileNetV3SmallStockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 410.82, N = 3SE +/- 21.33, N = 310966103231. (CXX) g++ options: -O3 -lrt -lm

OpenBenchmarking.orgus, Fewer Is BetterXNNPACK b7b048Model: QS8MobileNetV2StockAI/ML Tuning Recommendations2K4K6K8K10KSE +/- 154.48, N = 3SE +/- 87.21, N = 31004298131. (CXX) g++ options: -O3 -lrt -lm

68 Results Shown

OpenVINO:
  Age Gender Recognition Retail 0013 FP16 - CPU:
    FPS
    ms
  Person Detection FP16 - CPU:
    FPS
    ms
  Weld Porosity Detection FP16-INT8 - CPU:
    FPS
    ms
  Vehicle Detection FP16-INT8 - CPU:
    FPS
    ms
  Person Vehicle Bike Detection FP16 - CPU:
    FPS
    ms
  Machine Translation EN To DE FP16 - CPU:
    FPS
    ms
  Face Detection Retail FP16-INT8 - CPU:
    FPS
    ms
  Handwritten English Recognition FP16-INT8 - CPU:
    FPS
    ms
  Road Segmentation ADAS FP16-INT8 - CPU:
    FPS
    ms
  Person Re-Identification Retail FP16 - CPU:
    FPS
    ms
  Noise Suppression Poconet-Like FP16 - CPU:
    FPS
    ms
OpenVINO GenAI:
  Phi-3-mini-128k-instruct-int4-ov - CPU
  Falcon-7b-instruct-int4-ov - CPU
  Gemma-7b-int4-ov - CPU
Llama.cpp:
  CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Text Generation 128
  CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 512
  CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 1024
  CPU BLAS - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 2048
  CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Text Generation 128
  CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 512
  CPU BLAS - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 2048
  CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Text Generation 128
  CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 512
  CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 1024
  CPU BLAS - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 2048
Whisperfile:
  Tiny
  Small
  Medium
Whisper.cpp:
  ggml-small.en - 2016 State of the Union
  ggml-medium.en - 2016 State of the Union
TensorFlow:
  CPU - 256 - ResNet-50
  CPU - 512 - ResNet-50
LiteRT:
  Mobilenet Float
  NASNet Mobile
  SqueezeNet
  Inception V4
PyTorch:
  CPU - 256 - ResNet-50
  CPU - 256 - ResNet-152
  CPU - 512 - ResNet-50
  CPU - 512 - ResNet-152
oneDNN:
  Convolution Batch Shapes Auto - CPU
  Deconvolution Batch shapes_1d - CPU
  Deconvolution Batch shapes_3d - CPU
  IP Shapes 1D - CPU
  IP Shapes 3D - CPU
  Recurrent Neural Network Training - CPU
  Recurrent Neural Network Inference - CPU
Numpy Benchmark
ONNX Runtime:
  ResNet50 v1-12-int8 - CPU - Standard
  ResNet101_DUC_HDC-12 - CPU - Standard
XNNPACK:
  FP32MobileNetV1
  FP32MobileNetV2
  FP32MobileNetV3Small
  FP16MobileNetV1
  FP16MobileNetV2
  FP16MobileNetV3Small
  QS8MobileNetV2