NVIDIA LLAMA.CPP Tests for a future article. Intel Core Ultra 9 285K testing with a ASUS ROG MAXIMUS Z890 HERO (1203 BIOS) and ASUS NVIDIA GeForce RTX 5090 32GB on Ubuntu 24.10 via the Phoronix Test Suite.
HTML result view exported from: https://openbenchmarking.org/result/2501250-PTS-NVIDIALL68&gru .
NVIDIA LLAMA.CPP Processor Motherboard Chipset Memory Disk Graphics Audio Monitor Network OS Kernel Desktop Display Server Display Driver OpenGL OpenCL Compiler File-System Screen Resolution RTX 5090 Intel Core Ultra 9 285K @ 5.10GHz (24 Cores) ASUS ROG MAXIMUS Z890 HERO (1203 BIOS) Intel Device ae7f 2 x 16GB DDR5-6400MT/s Micron CP16G64C38U5B.M8D1 4001GB Western Digital WD_BLACK SN850X 4000GB + 1000GB Western Digital WDS100T1X0E-00AFY0 ASUS NVIDIA GeForce RTX 5090 32GB Intel Device 7f50 ASUS VP28U Realtek Device 8126 + Intel I226-V + Intel Wi-Fi 7 Ubuntu 24.10 6.11.0-13-generic (x86_64) GNOME Shell 47.0 X Server 1.21.1.13 NVIDIA 570.86.10 4.6.0 OpenCL 3.0 CUDA 12.8.51 + OpenCL 3.0 GCC 14.2.0 + CUDA 12.8 ext4 3840x2160 OpenBenchmarking.org - nouveau.modeset=0 - Transparent Huge Pages: madvise - --build=x86_64-linux-gnu --disable-vtable-verify --disable-werror --enable-bootstrap --enable-cet --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-gnu-unique-object --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust --enable-libphobos-checking=release --enable-libstdcxx-backtrace --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-link-serialization=2 --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc=auto --enable-offload-defaulted --enable-offload-targets=nvptx-none=/build/gcc-14-zdkDXv/gcc-14-14.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-14-zdkDXv/gcc-14-14.2.0/debian/tmp-gcn/usr --enable-plugin --enable-shared --enable-threads=posix --host=x86_64-linux-gnu --program-prefix=x86_64-linux-gnu- --target=x86_64-linux-gnu --with-abi=m64 --with-arch-32=i686 --with-build-config=bootstrap-lto-lean --with-default-libstdcxx-abi=new --with-gcc-major-version-only --with-multilib-list=m32,m64,mx32 --with-target-system-zlib=auto --with-tune=generic --without-cuda-driver -v - Scaling Governor: intel_pstate powersave (EPP: balance_performance) - CPU Microcode: 0x114 - Thermald 2.5.8 - gather_data_sampling: Not affected + itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + reg_file_data_sampling: Not affected + retbleed: Not affected + spec_rstack_overflow: Not affected + spec_store_bypass: Mitigation of SSB disabled via prctl + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Enhanced / Automatic IBRS; IBPB: conditional; RSB filling; PBRSB-eIBRS: Not affected; BHI: Not affected + srbds: Not affected + tsx_async_abort: Not affected
NVIDIA LLAMA.CPP llama-cpp: NVIDIA CUDA - Llama-3.1-Tulu-3-8B-Q8_0 - Text Generation 128 llama-cpp: NVIDIA CUDA - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 512 llama-cpp: NVIDIA CUDA - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 1024 llama-cpp: NVIDIA CUDA - Llama-3.1-Tulu-3-8B-Q8_0 - Prompt Processing 2048 llama-cpp: NVIDIA CUDA - Mistral-7B-Instruct-v0.3-Q8_0 - Text Generation 128 llama-cpp: NVIDIA CUDA - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 512 llama-cpp: NVIDIA CUDA - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 1024 llama-cpp: NVIDIA CUDA - Mistral-7B-Instruct-v0.3-Q8_0 - Prompt Processing 2048 llama-cpp: NVIDIA CUDA - granite-3.0-3b-a800m-instruct-Q8_0 - Text Generation 128 llama-cpp: NVIDIA CUDA - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 512 llama-cpp: NVIDIA CUDA - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 1024 llama-cpp: NVIDIA CUDA - granite-3.0-3b-a800m-instruct-Q8_0 - Prompt Processing 2048 RTX 5090 158.57 12730.90 12362.00 11198.06 166.28 12859.57 12357.95 11234.62 107.50 4166.56 4150.99 4099.51 OpenBenchmarking.org
Llama.cpp Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Text Generation 128 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Text Generation 128 RTX 5090 40 80 120 160 200 SE +/- 0.08, N = 3 158.57 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 512 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 512 RTX 5090 3K 6K 9K 12K 15K SE +/- 14.78, N = 3 12730.90 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 1024 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 1024 RTX 5090 3K 6K 9K 12K 15K SE +/- 20.16, N = 3 12362.00 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 2048 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Prompt Processing 2048 RTX 5090 2K 4K 6K 8K 10K SE +/- 3.12, N = 3 11198.06 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Text Generation 128 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Text Generation 128 RTX 5090 40 80 120 160 200 SE +/- 0.03, N = 3 166.28 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 512 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 512 RTX 5090 3K 6K 9K 12K 15K SE +/- 12.97, N = 3 12859.57 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 1024 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 1024 RTX 5090 3K 6K 9K 12K 15K SE +/- 22.52, N = 3 12357.95 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 2048 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 2048 RTX 5090 2K 4K 6K 8K 10K SE +/- 4.66, N = 3 11234.62 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Text Generation 128 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Text Generation 128 RTX 5090 20 40 60 80 100 SE +/- 0.11, N = 3 107.50 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 512 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 512 RTX 5090 900 1800 2700 3600 4500 SE +/- 16.70, N = 3 4166.56 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 1024 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 1024 RTX 5090 900 1800 2700 3600 4500 SE +/- 3.63, N = 3 4150.99 1. (CXX) g++ options: -O3
Llama.cpp Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 2048 OpenBenchmarking.org Tokens Per Second, More Is Better Llama.cpp b4397 Backend: NVIDIA CUDA - Model: granite-3.0-3b-a800m-instruct-Q8_0 - Test: Prompt Processing 2048 RTX 5090 900 1800 2700 3600 4500 SE +/- 0.83, N = 3 4099.51 1. (CXX) g++ options: -O3
Phoronix Test Suite v10.8.5