Llama.cpp

Llama.cpp is a port of Facebook's LLaMA model in C/C++ developed by Georgi Gerganov. Llama.cpp allows the inference of LLaMA and other supported models in C/C++. For CPU inference Llama.cpp supports AVX2/AVX-512, ARM NEON, and other modern ISAs along with features like OpenBLAS usage.


Llama.cpp b4397

Backend: CPU BLAS - Model: Mistral-7B-Instruct-v0.3-Q8_0 - Test: Prompt Processing 2048

OpenBenchmarking.org metrics for this test profile configuration based on 50 public results since 29 December 2024 with the latest data as of 7 January 2025.

Below is an overview of the generalized performance for components where there is sufficient statistically significant data based upon user-uploaded results. It is important to keep in mind particularly in the Linux/open-source space there can be vastly different OS configurations, with this overview intended to offer just general guidance as to the performance expectations.

Component
Details
Percentile Rank
# Compatible Public Results
Tokens Per Second (Average)
Zen 5 [16 Cores / 32 Threads]
91st
4
90 +/- 1
Zen 4 [192 Cores / 384 Threads]
83rd
4
78 +/- 1
Mid-Tier
75th
< 77
Zen 4 [192 Cores / 384 Threads]
71st
6
76 +/- 1
Zen 4 [64 Cores / 128 Threads]
67th
4
70
Median
50th
47
Arrow Lake [24 Cores / 24 Threads]
49th
3
47
Zen 4 [8 Cores / 16 Threads]
41st
3
38
Zen 5 [10 Cores / 20 Threads]
35th
4
33 +/- 1
Low-Tier
25th
< 30
Zen 5 [12 Cores / 24 Threads]
25th
3
30
Lunar Lake [8 Cores / 8 Threads]
19th
4
27
Meteor Lake [16 Cores / 22 Threads]
9th
3
20
Alder Lake [14 Cores / 20 Threads]
5th
3
15