Llama.cpp

Llama.cpp is a port of Facebook's LLaMA model in C/C++ developed by Georgi Gerganov. Llama.cpp allows the inference of LLaMA and other supported models in C/C++. For CPU inference Llama.cpp supports AVX2/AVX-512, ARM NEON, and other modern ISAs along with features like OpenBLAS usage.


Llama.cpp b4154

Backend: CPU BLAS - Model: Llama-3.1-Tulu-3-8B-Q8_0 - Test: Text Generation 128

OpenBenchmarking.org metrics for this test profile configuration based on 67 public results since 23 November 2024 with the latest data as of 15 December 2024.

Below is an overview of the generalized performance for components where there is sufficient statistically significant data based upon user-uploaded results. It is important to keep in mind particularly in the Linux/open-source space there can be vastly different OS configurations, with this overview intended to offer just general guidance as to the performance expectations.

Component
Details
Percentile Rank
# Compatible Public Results
Tokens Per Second (Average)
Zen 5 [64 Cores / 128 Threads]
99th
4
51.8 +/- 0.1
Zen 5 [96 Cores / 192 Threads]
93rd
5
46.0 +/- 0.3
81st
9
20.3 +/- 0.9
Mid-Tier
75th
< 19.0
Zen 4 [64 Cores / 128 Threads]
67th
4
15.5
Zen 5 [10 Cores / 20 Threads]
60th
3
10.3
Zen 5 [12 Cores / 24 Threads]
54th
5
10.0 +/- 0.1
Zen 5 [16 Cores / 32 Threads]
51st
4
9.3
Median
50th
9.3
Lunar Lake [8 Cores / 8 Threads]
42nd
4
8.7
Zen 5 [8 Cores / 16 Threads]
37th
4
7.3
Zen 4 [8 Cores / 16 Threads]
31st
4
7.3
Low-Tier
25th
< 7.3
Meteor Lake [16 Cores / 22 Threads]
25th
3
7.2 +/- 0.1
Zen 4 [12 Cores / 24 Threads]
25th
4
7.1
Zen 4 [16 Cores / 32 Threads]
12th
5
6.9