Microsoft Azure HBv4 HPC Performance Benchmarks Benchmarks for a future article on Phoronix looking at HBv4 Genoa-X Linux performance.. HBv4: Processor: 2 x AMD EPYC 9V33X 96-Core (176 Cores), Motherboard: Microsoft Virtual Machine (Hyper-V UEFI v4.1 BIOS), Memory: 1 GB + 59 GB + 116 GB + 176 GB + 176 GB + 176 GB, Disk: 2 x 1920GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Disk, Graphics: hyperv_fb OS: AlmaLinux 8.8, Kernel: 4.18.0-425.3.1.el8.x86_64 (x86_64), Compiler: GCC 13.1.0 + CUDA 12.1, File-System: nfs, Screen Resolution: 1024x768, System Layer: microsoft HBv3: Processor: 2 x AMD EPYC 7V73X 64-Core (120 Cores), Motherboard: Microsoft Virtual Machine (Hyper-V UEFI v4.1 BIOS), Memory: 1 GB + 59 GB + 54 GB + 114 GB + 114 GB + 114 GB, Disk: 2 x 960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Disk, Graphics: hyperv_fb OS: AlmaLinux 8.8, Kernel: 4.18.0-425.3.1.el8.x86_64 (x86_64), Compiler: GCC 13.1.0 + CUDA 12.1, File-System: nfs, Screen Resolution: 1024x768, System Layer: microsoft HBv2: Processor: 2 x AMD EPYC 7V12 64-Core (120 Cores), Motherboard: Microsoft Virtual Machine (Hyper-V UEFI v4.1 BIOS), Memory: 1 GB + 59 GB + 54 GB + 114 GB + 114 GB + 114 GB, Disk: 960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Disk, Graphics: hyperv_fb OS: AlmaLinux 8.8, Kernel: 4.18.0-425.3.1.el8.x86_64 (x86_64), Compiler: GCC 13.1.0 + CUDA 12.1, File-System: nfs, Screen Resolution: 1024x768, System Layer: microsoft HC: Processor: 2 x Intel Xeon Platinum 8168 (44 Cores), Motherboard: Microsoft Virtual Machine (Hyper-V UEFI v4.1 BIOS), Memory: 1 GB + 60928 MB + 118272 MB + 176 GB, Disk: 32GB Virtual Disk + 752GB Virtual Disk, Graphics: hyperv_fb OS: AlmaLinux 8.8, Kernel: 4.18.0-425.3.1.el8.x86_64 (x86_64), Compiler: GCC 13.1.0 + CUDA 12.1, File-System: nfs, Screen Resolution: 1024x768, System Layer: microsoft NAS Parallel Benchmarks 3.4 Test / Class: BT.C Total Mop/s > Higher Is Better HBv4 . 744413.90 |============================================================= HBv3 . 313813.98 |========================== HBv2 . 241509.88 |==================== HC ... 106230.52 |========= Pennant 1.0.1 Test: sedovbig Hydro Cycle Time - Seconds < Lower Is Better HBv4 . 3.581391 |========= HBv3 . 6.277107 |=============== HBv2 . 5.915805 |============== HC ... 25.019560 |============================================================= NAS Parallel Benchmarks 3.4 Test / Class: MG.C Total Mop/s > Higher Is Better HBv4 . 437417.16 |============================================================= HBv3 . 131635.41 |================== HBv2 . 108985.72 |=============== HC ... 63404.01 |========= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 355.51 |================================================================ HBv3 . 135.95 |======================== HBv2 . 96.49 |================= HC ... 62.90 |=========== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: float - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 355.86 |================================================================ HBv3 . 135.69 |======================== HBv2 . 95.88 |================= HC ... 62.98 |=========== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: float - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 323.36 |================================================================ HBv3 . 123.24 |======================== HBv2 . 93.79 |=================== HC ... 57.76 |=========== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 323.70 |================================================================ HBv3 . 124.60 |========================= HBv2 . 93.26 |================== HC ... 57.92 |=========== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 624.95 |================================================================ HBv3 . 257.42 |========================== HBv2 . 191.14 |==================== HC ... 113.94 |============ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: float - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 622.58 |================================================================ HBv3 . 254.25 |========================== HBv2 . 191.78 |==================== HC ... 114.03 |============ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: float - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 596.23 |================================================================ HBv3 . 232.17 |========================= HBv2 . 190.95 |==================== HC ... 110.05 |============ Blender 3.6 Blend File: Classroom - Compute: CPU-Only Seconds < Lower Is Better HBv4 . 25.61 |============ HBv3 . 50.71 |======================= HBv2 . 50.95 |======================== HC ... 138.51 |================================================================ Blender 3.6 Blend File: Barbershop - Compute: CPU-Only Seconds < Lower Is Better HBv4 . 97.52 |============ HBv3 . 188.96 |======================= HBv2 . 211.46 |========================== HC ... 526.93 |================================================================ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: float-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 590.93 |================================================================ HBv3 . 233.80 |========================= HBv2 . 189.21 |==================== HC ... 110.20 |============ Blender 3.6 Blend File: Pabellon Barcelona - Compute: CPU-Only Seconds < Lower Is Better HBv4 . 33.01 |============ HBv3 . 62.90 |======================= HBv2 . 64.84 |======================== HC ... 175.07 |================================================================ Blender 3.6 Blend File: Fishy Cat - Compute: CPU-Only Seconds < Lower Is Better HBv4 . 13.74 |============ HBv3 . 25.59 |======================= HBv2 . 26.43 |======================== HC ... 71.76 |================================================================= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: double - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 311.80 |================================================================ HBv3 . 117.73 |======================== HBv2 . 94.53 |=================== HC ... 59.82 |============ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 311.27 |================================================================ HBv3 . 118.24 |======================== HBv2 . 95.20 |==================== HC ... 59.90 |============ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 315.98 |================================================================ HBv3 . 120.96 |======================== HBv2 . 91.43 |=================== HC ... 60.82 |============ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: double - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 314.34 |================================================================ HBv3 . 121.28 |========================= HBv2 . 91.48 |=================== HC ... 60.88 |============ Pennant 1.0.1 Test: leblancbig Hydro Cycle Time - Seconds < Lower Is Better HBv4 . 2.122074 |============ HBv3 . 3.649317 |===================== HBv2 . 3.466885 |==================== HC ... 10.645480 |============================================================= 7-Zip Compression 22.01 Test: Compression Rating MIPS > Higher Is Better HBv4 . 1083523 |=============================================================== HBv3 . 566595 |================================= HBv2 . 501534 |============================= HC ... 216451 |============= Blender 3.6 Blend File: BMW27 - Compute: CPU-Only Seconds < Lower Is Better HBv4 . 10.11 |============= HBv3 . 19.43 |========================= HBv2 . 19.58 |========================= HC ... 49.95 |================================================================= 7-Zip Compression 22.01 Test: Decompression Rating MIPS > Higher Is Better HBv4 . 742859 |================================================================ HBv3 . 406516 |=================================== HBv2 . 388577 |================================= HC ... 150841 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: double - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 154.65 |================================================================ HBv3 . 56.22 |======================= HBv2 . 46.98 |=================== HC ... 31.57 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: double-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 154.57 |================================================================ HBv3 . 56.27 |======================= HBv2 . 46.93 |=================== HC ... 31.58 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 273.12 |================================================================ HBv3 . 106.63 |========================= HBv2 . 88.61 |===================== HC ... 57.13 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: double - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 159.18 |================================================================ HBv3 . 57.33 |======================= HBv2 . 47.61 |=================== HC ... 33.52 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 512 GFLOP/s > Higher Is Better HBv4 . 159.26 |================================================================ HBv3 . 57.23 |======================= HBv2 . 47.37 |=================== HC ... 33.55 |============= HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: float - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 256.35 |================================================================ HBv3 . 103.51 |========================== HBv2 . 91.54 |======================= HC ... 58.36 |=============== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: double - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 264.95 |================================================================ HBv3 . 102.70 |========================= HBv2 . 93.31 |======================= HC ... 60.57 |=============== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 255.97 |================================================================ HBv3 . 105.09 |========================== HBv2 . 90.79 |======================= HC ... 58.55 |=============== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 258.72 |================================================================ HBv3 . 105.50 |========================== HBv2 . 92.39 |======================= HC ... 60.89 |=============== Liquid-DSP 1.6 Threads: 176 - Buffer Length: 256 - Filter Length: 57 samples/s > Higher Is Better HBv4 . 7095033333 |============================================================ HBv3 . 4281533333 |==================================== HBv2 . 4350100000 |===================================== HC ... 1683033333 |============== NAS Parallel Benchmarks 3.4 Test / Class: FT.C Total Mop/s > Higher Is Better HBv4 . 230164.79 |============================================================= HBv3 . 102122.36 |=========================== HBv2 . 98485.23 |========================== HC ... 55288.19 |=============== OSPRay 2.12 Benchmark: particle_volume/scivis/real_time Items Per Second > Higher Is Better HBv4 . 36.54460 |============================================================== HBv3 . 24.21970 |========================================= HBv2 . 22.17470 |====================================== HC ... 8.87831 |=============== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: float - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 244.34 |================================================================ HBv3 . 103.41 |=========================== HBv2 . 91.26 |======================== HC ... 59.73 |================ Liquid-DSP 1.6 Threads: 176 - Buffer Length: 256 - Filter Length: 512 samples/s > Higher Is Better HBv4 . 2221966667 |============================================================ HBv3 . 814950000 |====================== HBv2 . 924243333 |========================= HC ... 544626667 |=============== OSPRay 2.12 Benchmark: particle_volume/ao/real_time Items Per Second > Higher Is Better HBv4 . 36.65480 |============================================================== HBv3 . 24.47100 |========================================= HBv2 . 22.36680 |====================================== HC ... 8.99618 |=============== Liquid-DSP 1.6 Threads: 176 - Buffer Length: 256 - Filter Length: 32 samples/s > Higher Is Better HBv4 . 6181766667 |============================================================ HBv3 . 3864000000 |====================================== HBv2 . 4275533333 |========================================= HC ... 1536633333 |=============== NAMD 2.14 ATPase Simulation - 327,506 Atoms days/ns < Lower Is Better HBv4 . 0.14380 |================= HBv3 . 0.27111 |================================ HBv2 . 0.26505 |================================ HC ... 0.52697 |=============================================================== Liquid-DSP 1.6 Threads: 128 - Buffer Length: 256 - Filter Length: 57 samples/s > Higher Is Better HBv4 . 5412900000 |============================================================ HBv3 . 4216966667 |=============================================== HBv2 . 4309133333 |================================================ HC ... 1570633333 |================= High Performance Conjugate Gradient 3.1 X Y Z: 160 160 160 - RT: 60 GFLOP/s > Higher Is Better HBv4 . 87.90 |================================================================= HBv3 . 39.11 |============================= HBv2 . 36.02 |=========================== HC ... 25.56 |=================== High Performance Conjugate Gradient 3.1 X Y Z: 104 104 104 - RT: 60 GFLOP/s > Higher Is Better HBv4 . 89.38 |================================================================= HBv3 . 39.61 |============================= HBv2 . 37.04 |=========================== HC ... 26.00 |=================== High Performance Conjugate Gradient 3.1 X Y Z: 144 144 144 - RT: 60 GFLOP/s > Higher Is Better HBv4 . 88.52 |================================================================= HBv3 . 38.97 |============================= HBv2 . 36.09 |=========================== HC ... 25.87 |=================== OSPRay 2.12 Benchmark: gravity_spheres_volume/dim_512/pathtracer/real_time Items Per Second > Higher Is Better HBv4 . 32.58 |================================================================= HBv3 . 14.61 |============================= HBv2 . 13.94 |============================ HC ... 10.06 |==================== PostgreSQL 15 Scaling Factor: 1 - Clients: 800 - Mode: Read Only - Average Latency ms < Lower Is Better HBv4 . 0.254 |======================== HBv3 . 0.323 |============================== HBv2 . 0.323 |============================== HC ... 0.690 |================================================================= PostgreSQL 15 Scaling Factor: 1 - Clients: 800 - Mode: Read Only TPS > Higher Is Better HBv4 . 3146173 |=============================================================== HBv3 . 2478917 |================================================== HBv2 . 2481320 |================================================== HC ... 1159492 |======================= oneDNN 3.1 Harness: Recurrent Neural Network Training - Data Type: bf16bf16bf16 - Engine: CPU ms < Lower Is Better HBv4 . 533.49 |========================= HBv3 . 886.81 |========================================= HBv2 . 1367.73 |=============================================================== HC ... 707.32 |================================= PostgreSQL 15 Scaling Factor: 1 - Clients: 500 - Mode: Read Only TPS > Higher Is Better HBv4 . 3161848 |=============================================================== HBv3 . 2434749 |================================================= HBv2 . 2467328 |================================================= HC ... 1353510 |=========================== PostgreSQL 15 Scaling Factor: 1 - Clients: 500 - Mode: Read Only - Average Latency ms < Lower Is Better HBv4 . 0.158 |============================ HBv3 . 0.206 |==================================== HBv2 . 0.203 |==================================== HC ... 0.369 |================================================================= oneDNN 3.1 Harness: Recurrent Neural Network Inference - Data Type: bf16bf16bf16 - Engine: CPU ms < Lower Is Better HBv4 . 411.23 |============================= HBv3 . 529.97 |===================================== HBv2 . 910.94 |================================================================ HC ... 442.47 |=============================== Timed Node.js Compilation 19.8.1 Time To Compile Seconds < Lower Is Better HBv4 . 150.56 |============================= HBv3 . 185.57 |==================================== HBv2 . 194.37 |====================================== HC ... 330.61 |================================================================ libxsmm 2-1.17-3645 M N K: 64 GFLOPS/s > Higher Is Better HBv4 . 5898.2 |================================================================ HBv3 . 2413.7 |========================== HBv2 . 331.4 |==== HC ... 748.1 |======== NAS Parallel Benchmarks 3.4 Test / Class: SP.C Total Mop/s > Higher Is Better HBv4 . 427298.99 |============================================================= HBv3 . 205795.59 |============================= HBv2 . 104771.90 |=============== HC ... 41543.94 |====== Intel Open Image Denoise 2.0 Run: RT.ldr_alb_nrm.3840x2160 - Device: CPU-Only Images / Sec > Higher Is Better HBv4 . 3.08 |================================================================== HBv3 . 1.69 |==================================== HBv2 . 2.01 |=========================================== HC ... 1.85 |======================================== Intel Open Image Denoise 2.0 Run: RT.hdr_alb_nrm.3840x2160 - Device: CPU-Only Images / Sec > Higher Is Better HBv4 . 3.11 |================================================================== HBv3 . 1.72 |===================================== HBv2 . 2.03 |=========================================== HC ... 1.85 |======================================= Intel Open Image Denoise 2.0 Run: RTLightmap.hdr.4096x4096 - Device: CPU-Only Images / Sec > Higher Is Better HBv4 . 1.32 |================================================================== HBv3 . 0.80 |======================================== HBv2 . 0.96 |================================================ HC ... 0.87 |============================================ Laghos 3.1 Test: Sedov Blast Wave, ube_922_hex.mesh Major Kernels Total Rate > Higher Is Better HBv4 . 402.94 |================================================================ HBv3 . 361.81 |========================================================= HBv2 . 345.14 |======================================================= HC ... 247.49 |======================================= Laghos 3.1 Test: Triple Point Problem Major Kernels Total Rate > Higher Is Better HBv4 . 228.15 |================================================================ HBv3 . 192.74 |====================================================== HBv2 . 183.82 |==================================================== HC ... 156.52 |============================================ PETSc 3.19 Test: Streams MB/s > Higher Is Better HBv4 . 598417.70 |============================================================= HBv3 . 284001.92 |============================= HBv2 . 197895.47 |==================== HC ... 151286.25 |=============== OSPRay 2.12 Benchmark: gravity_spheres_volume/dim_512/scivis/real_time Items Per Second > Higher Is Better HBv4 . 37.06240 |============================================================== HBv3 . 11.17230 |=================== HBv2 . 8.32323 |============== HC ... 9.02689 |=============== OSPRay 2.12 Benchmark: gravity_spheres_volume/dim_512/ao/real_time Items Per Second > Higher Is Better HBv4 . 38.07690 |============================================================== HBv3 . 11.75010 |=================== HBv2 . 8.66888 |============== HC ... 9.52293 |================ OSPRay 2.12 Benchmark: particle_volume/pathtracer/real_time Items Per Second > Higher Is Better HBv4 . 208.05 |================================================================ HBv3 . 167.50 |==================================================== HBv2 . 162.45 |================================================== HC ... 96.76 |============================== ACES DGEMM 1.0 Sustained Floating-Point Rate GFLOP/s > Higher Is Better HBv4 . 52.802440 |============================================================= HBv3 . 25.048352 |============================= HBv2 . 6.395415 |======= HC ... 14.072027 |================ HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 247.73 |================================================================ HBv3 . 105.36 |=========================== HBv2 . 92.13 |======================== HC ... 59.55 |=============== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 427.10 |================================================================ HBv3 . 221.86 |================================= HBv2 . 200.04 |============================== HC ... 122.77 |================== HeFFTe - Highly Efficient FFT for Exascale 2.3 Test: r2c - Backend: FFTW - Precision: double - X Y Z: 256 GFLOP/s > Higher Is Better HBv4 . 261.90 |================================================================ HBv3 . 103.25 |========================= HBv2 . 91.92 |====================== HC ... 57.31 |============== libxsmm 2-1.17-3645 M N K: 32 GFLOPS/s > Higher Is Better HBv4 . 6163.0 |================================================================ HBv3 . 1438.1 |=============== HBv2 . 164.8 |== HC ... 384.9 |==== libxsmm 2-1.17-3645 M N K: 256 GFLOPS/s > Higher Is Better HBv4 . 6908.6 |================================================================ HBv3 . 2045.7 |=================== HBv2 . 1128.3 |========== HC ... 904.1 |======== libxsmm 2-1.17-3645 M N K: 128 GFLOPS/s > Higher Is Better HBv4 . 6655.2 |================================================================ HBv3 . 2273.5 |====================== HBv2 . 1011.4 |========== HC ... 1284.8 |============ NAS Parallel Benchmarks 3.4 Test / Class: IS.D Total Mop/s > Higher Is Better HBv4 . 12967.37 |============================================================== HBv3 . 5730.01 |=========================== HBv2 . 3977.02 |=================== HC ... 1864.68 |========= NAS Parallel Benchmarks 3.4 Test / Class: CG.C Total Mop/s > Higher Is Better HBv4 . 74101.94 |============================================================== HBv3 . 36681.43 |=============================== HBv2 . 36367.35 |============================== HC ... 27619.05 |=======================