Microsoft Azure HBv4 HPC Comparison Benchmarks

Benchmarks for a future article on Phoronix looking at HBv4 Genoa-X Linux performance..

HTML result view exported from: https://openbenchmarking.org/result/2307288-NE-2307274NE45&sor&grr.

Microsoft Azure HBv4 HPC Comparison BenchmarksProcessorMotherboardMemoryDiskGraphicsOSKernelCompilerFile-SystemScreen ResolutionSystem LayerHCHBv2HBv3HBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHC + Optimizations2 x Intel Xeon Platinum 8168 (44 Cores)Microsoft Virtual Machine (Hyper-V UEFI v4.1 BIOS)1 GB + 60928 MB + 118272 MB + 176 GB32GB Virtual Disk + 752GB Virtual Disk hyperv_fbAlmaLinux 8.74.18.0-425.3.1.el8.x86_64 (x86_64)GCC 8.5.0 20210514 + CUDA 12.1nfs1024x768microsoft2 x AMD EPYC 7V12 64-Core (120 Cores)1 GB + 59 GB + 54 GB + 114 GB + 114 GB + 114 GB960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Diskhyperv_fb2 x AMD EPYC 7V73X 64-Core (120 Cores)2 x 960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Disk2 x AMD EPYC 9V33X 96-Core (176 Cores)1 GB + 59 GB + 116 GB + 176 GB + 176 GB + 176 GB2 x 1920GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual DiskAlmaLinux 8.82 x 1920GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual DiskGCC 13.1.0 + CUDA 12.12 x AMD EPYC 7V73X 64-Core (120 Cores)1 GB + 59 GB + 54 GB + 114 GB + 114 GB + 114 GB2 x 960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual DiskAlmaLinux 8.72 x AMD EPYC 7V12 64-Core (120 Cores)960GB Microsoft NVMe Direct Disk + 32GB Virtual Disk + 515GB Virtual Disk2 x Intel Xeon Platinum 8168 (44 Cores)1 GB + 60928 MB + 118272 MB + 176 GB32GB Virtual Disk + 752GB Virtual DiskOpenBenchmarking.orgKernel Details- HBv4 + Optimizations, HBv3 + Optimizations, HBv2 + Optimizations, HC + Optimizations: Transparent Huge Pages: alwaysEnvironment Details- HBv4 + Optimizations, HBv3 + Optimizations, HBv2 + Optimizations, HC + Optimizations: CFLAGS="-O3 -march=native" CXXFLAGS="-O3 -march=native"Compiler Details- HBv4 + Optimizations, HBv3 + Optimizations, HBv2 + Optimizations, HC + Optimizations: --disable-multilib --enable-checking=releaseProcessor Details- HBv4 + Optimizations, HBv3 + Optimizations, HBv2 + Optimizations, HC + Optimizations: CPU Microcode: 0xffffffffPython Details- HBv4 + Optimizations, HBv3 + Optimizations, HBv2 + Optimizations, HC + Optimizations: Python 3.6.8Security Details- HBv4 + Optimizations: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + retbleed: Not affected + spec_store_bypass: Vulnerable + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Retpolines STIBP: disabled RSB filling PBRSB-eIBRS: Not affected + srbds: Not affected + tsx_async_abort: Not affected- HBv3 + Optimizations: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + retbleed: Not affected + spec_store_bypass: Vulnerable + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Retpolines STIBP: disabled RSB filling PBRSB-eIBRS: Not affected + srbds: Not affected + tsx_async_abort: Not affected- HBv2 + Optimizations: itlb_multihit: Not affected + l1tf: Not affected + mds: Not affected + meltdown: Not affected + mmio_stale_data: Not affected + retbleed: Mitigation of untrained return thunk; SMT disabled + spec_store_bypass: Mitigation of SSB disabled via prctl + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Retpolines STIBP: disabled RSB filling PBRSB-eIBRS: Not affected + srbds: Not affected + tsx_async_abort: Not affected- HC + Optimizations: itlb_multihit: Not affected + l1tf: Mitigation of PTE Inversion + mds: Vulnerable: Clear buffers attempted no microcode; SMT Host state unknown + meltdown: Mitigation of PTI + mmio_stale_data: Vulnerable: Clear buffers attempted no microcode; SMT Host state unknown + retbleed: Vulnerable + spec_store_bypass: Vulnerable + spectre_v1: Mitigation of usercopy/swapgs barriers and __user pointer sanitization + spectre_v2: Mitigation of Retpolines STIBP: disabled RSB filling PBRSB-eIBRS: Not affected + srbds: Not affected + tsx_async_abort: Vulnerable: Clear buffers attempted no microcode; SMT Host state unknown

Microsoft Azure HBv4 HPC Comparison Benchmarksbuild-linux-kernel: allmodconfiglibxsmm: 128libxsmm: 256petsc: Streamsospray: particle_volume/pathtracer/real_timehpcg: 160 160 160 - 60blender: Barbershop - CPU-Onlyhpcg: 144 144 144 - 60build-nodejs: Time To Compileospray: particle_volume/scivis/real_timepgbench: 1 - 500 - Read Only - Average Latencypgbench: 1 - 500 - Read Onlypgbench: 1 - 800 - Read Only - Average Latencypgbench: 1 - 800 - Read Onlyospray: gravity_spheres_volume/dim_512/scivis/real_timeospray: gravity_spheres_volume/dim_512/ao/real_timeospray: particle_volume/ao/real_timehpcg: 104 104 104 - 60blender: Pabellon Barcelona - CPU-Onlyblender: Classroom - CPU-Onlymt-dgemm: Sustained Floating-Point Ratenpb: SP.Claghos: Sedov Blast Wave, ube_922_hex.meshospray: gravity_spheres_volume/dim_512/pathtracer/real_timenpb: EP.Dblender: BMW27 - CPU-Onlylibxsmm: 64liquid-dsp: 32 - 256 - 57compress-7zip: Decompression Ratingcompress-7zip: Compression Ratingonednn: Recurrent Neural Network Training - f32 - CPUnpb: BT.Cblender: Fishy Cat - CPU-Onlyonednn: Recurrent Neural Network Training - bf16bf16bf16 - CPUlaghos: Triple Point Problemliquid-dsp: 176 - 256 - 512liquid-dsp: 176 - 256 - 32oidn: RTLightmap.hdr.4096x4096 - CPU-Onlyliquid-dsp: 176 - 256 - 57liquid-dsp: 128 - 256 - 57liquid-dsp: 128 - 256 - 32liquid-dsp: 32 - 256 - 32liquid-dsp: 1 - 256 - 32namd: ATPase Simulation - 327,506 Atomsonednn: Recurrent Neural Network Inference - bf16bf16bf16 - CPUnpb: IS.Donednn: Recurrent Neural Network Inference - f32 - CPUlibxsmm: 32oidn: RT.ldr_alb_nrm.3840x2160 - CPU-Onlyheffte: c2c - Stock - double-long - 512heffte: c2c - Stock - double - 512heffte: c2c - FFTW - double-long - 512heffte: c2c - FFTW - double - 512heffte: r2c - FFTW - float-long - 256heffte: r2c - Stock - float - 256heffte: c2c - FFTW - double - 128heffte: r2c - Stock - float-long - 256heffte: c2c - Stock - double - 128heffte: r2c - FFTW - double - 256heffte: r2c - Stock - double-long - 256npb: MG.Cheffte: c2c - FFTW - float-long - 256heffte: c2c - Stock - float-long - 512heffte: c2c - Stock - float - 512heffte: r2c - FFTW - double-long - 256heffte: r2c - Stock - double-long - 512heffte: r2c - Stock - double - 512heffte: r2c - FFTW - double-long - 512heffte: r2c - FFTW - double - 512heffte: r2c - Stock - double - 256heffte: c2c - FFTW - float - 512heffte: c2c - FFTW - float-long - 512npb: CG.Cheffte: c2c - Stock - float - 256heffte: c2c - FFTW - float - 256npb: FT.Cheffte: c2c - FFTW - double-long - 256heffte: c2c - FFTW - double-long - 128heffte: r2c - FFTW - float - 256oidn: RT.hdr_alb_nrm.3840x2160 - CPU-Onlyheffte: c2c - Stock - float-long - 256heffte: r2c - FFTW - float - 512heffte: c2c - Stock - double - 256remhos: Sample Remap Examplepennant: sedovbigheffte: r2c - Stock - float - 512heffte: r2c - Stock - float-long - 512heffte: r2c - FFTW - float-long - 512onednn: IP Shapes 1D - f32 - CPUpennant: leblancbigheffte: c2c - Stock - double-long - 256heffte: c2c - FFTW - double - 256onednn: IP Shapes 3D - f32 - CPUonednn: Convolution Batch Shapes Auto - f32 - CPUonednn: Deconvolution Batch shapes_3d - f32 - CPUHCHBv2HBv3HBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHC + Optimizations1950.6261328.4898.8151286.249186.573425.5635524.8625.8659330.6138.970200.36913548770.68811618008.987239.494218.9754725.9971176.21138.8114.34083012907.54247.4910.04901642.0350.53731.6721290909148193210732707.35328794.2872.57707.322156.5252921333315661333330.88166473333315724000001512600000964423333317963330.52650442.4711181.48450.247379.91.8431.584631.571833.554533.5193122.772134.76059.1442131.96241.734557.310160.887219508.0058.549857.920357.764357.129059.895459.821660.820460.880460.572762.975062.902714356.2059.729258.356720188.8930.217558.9125123.6321.8259.5527114.02530.166327.37825.01956110.049110.197113.9400.88244610.6454830.267230.11902.079203.111211.244801782.9331519.51444.2197895.4717157.13336.0167210.1836.0866194.36722.15330.20324662490.32824396508.123568.6732722.333637.041064.1450.865.89990332495.89345.1413.91513222.8219.46411.711934000003710444894561345.1466829.1826.191367.73183.8282565333340271000001.044106700000404593333339259333331061433333332116670.26385910.9371884.22896.813195.12.0346.928946.979447.369647.6050200.035205.20659.4244211.41851.395591.918692.388343410.7190.788393.257393.792388.608195.198994.530191.429691.480293.313795.880196.494122314.0291.260191.538341977.6951.195461.1403203.7722.0892.1290191.77550.707014.9315.915805190.949189.208191.1411.407583.46688550.075950.90326.838250.5738781.610021889.4632284.62032.1284001.9162168.24239.1106189.3038.9739185.56724.17360.21023750050.332240760211.184511.748524.458639.609362.6451.0825.10487631024.76361.8114.60672879.0819.492435.61086000000397505558290860.97562427.8625.47886.810192.7473537000034195333330.79356343333335163000003366733333917336667328173330.27115529.9732793.55533.4961506.31.6956.269056.216157.226357.3307221.861214.06359.3811207.97450.6068103.2457105.500346705.47105.093124.595123.242106.632118.236117.731120.957121.283102.7046135.694135.95021551.48103.409103.514736619.2939.370956.8693198.6601.68105.361254.25238.446115.2566.277107232.166233.797257.4190.9100913.64931738.569439.81170.6242330.5567411.408621681.2556585.66983.2598417.6957208.33887.901396.7788.5160150.55836.56710.15931398460.256312304237.091838.076436.612189.384033.4025.2653.17569168819.34402.9432.79115985.759.975719.013905400007279951032267535.853151067.8113.96533.494228.15205823333361222333331.296758166667516823333344263000001113300000353626670.14292411.2345870.00401.8555006.83.13154.568154.648159.258159.175427.101459.91880.2514467.71887.6623261.903258.716108125.86255.968323.696323.356273.121311.267311.803315.982314.336264.954355.855355.51240326.29244.342256.34969051.63122.98185.0078442.8293.08247.725622.580121.60515.3703.581391596.226590.925624.9510.7529292.122074123.408123.3910.3061410.2764720.5828066655.26908.6208.05097.5236.54460.15831618480.254314617337.062438.076936.654833.0125.6152.802440427298.9932.58399031.4610.115898.214632000007428591083523744413.9013.74222196666761817666671.327095033333541290000044672666671122866667356936670.1438012967.376163.03.08437417.1674101.94230164.793.112273.52045.7167.504188.9624.21970.20624347490.323247891711.172311.750124.471062.9050.7125.048352205795.5914.60884840.0719.432413.71347733333406516566595313813.9825.5981495000038640000000.804281533333421696666738328000001045000000371750000.271115730.011438.11.69131635.4136681.43102122.361.721011.41128.3162.449211.4622.17470.20324673280.32324813208.323238.6688822.366864.8450.956.395415104771.9013.94165542.0819.58331.41257833333388577501534241509.8826.4392424333342755333330.964350100000430913333341968333331136733333350806670.265053977.02164.82.01108985.7236367.3598485.232.031284.8904.196.7630526.938.878310.36913535100.69011594929.026899.522938.99618175.07138.5114.07202741543.9410.06111853.4749.95748.1719580000150841216451106230.5271.7654462666715366333330.87168303333315706333331478433333948450000312620000.526971864.68384.91.8563404.0127619.0555288.191.85OpenBenchmarking.org

Timed Linux Kernel Compilation

Build: allmodconfig

OpenBenchmarking.orgSeconds, Fewer Is BetterTimed Linux Kernel Compilation 6.1Build: allmodconfigHBv4HBv2HBv3HC400800120016002000SE +/- 32.03, N = 9SE +/- 22.46, N = 3SE +/- 22.02, N = 3SE +/- 7.59, N = 31681.261782.931889.461950.63

libxsmm

M N K: 128

OpenBenchmarking.orgGFLOPS/s, More Is Betterlibxsmm 2-1.17-3645M N K: 128HBv4 + OptimizationsHBv4HBv3HBv3 + OptimizationsHBv2HCHC + OptimizationsHBv2 + Optimizations14002800420056007000SE +/- 59.23, N = 3SE +/- 59.85, N = 3SE +/- 29.40, N = 3SE +/- 20.51, N = 9SE +/- 153.42, N = 6SE +/- 11.02, N = 3SE +/- 13.64, N = 15SE +/- 169.50, N = 96655.26585.62284.62273.51519.51328.41284.81011.41. (CXX) g++ options: -dynamic -Bstatic -static-libgcc -lgomp -lpthread -lm -lrt -ldl -lquadmath -lstdc++ -pthread -fPIC -std=c++14 -pedantic -O2 -fopenmp -fopenmp-simd -funroll-loops -ftree-vectorize -fdata-sections -ffunction-sections -fvisibility=hidden -march=core-avx2

libxsmm

M N K: 256

OpenBenchmarking.orgGFLOPS/s, More Is Betterlibxsmm 2-1.17-3645M N K: 256HBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv3HBv2HBv2 + OptimizationsHC + OptimizationsHC15003000450060007500SE +/- 63.60, N = 3SE +/- 57.85, N = 9SE +/- 25.11, N = 4SE +/- 23.34, N = 3SE +/- 51.69, N = 9SE +/- 17.53, N = 9SE +/- 23.39, N = 9SE +/- 13.41, N = 126983.26908.62045.72032.11444.21128.3904.1898.81. (CXX) g++ options: -dynamic -Bstatic -static-libgcc -lgomp -lpthread -lm -lrt -ldl -lquadmath -lstdc++ -pthread -fPIC -std=c++14 -pedantic -O2 -fopenmp -fopenmp-simd -funroll-loops -ftree-vectorize -fdata-sections -ffunction-sections -fvisibility=hidden -march=core-avx2

PETSc

Test: Streams

OpenBenchmarking.orgMB/s, More Is BetterPETSc 3.19Test: StreamsHBv4HBv3HBv2HC130K260K390K520K650KSE +/- 46271.80, N = 9SE +/- 2674.31, N = 7SE +/- 12025.83, N = 6SE +/- 256.75, N = 3598417.70284001.92197895.47151286.251. (CXX) g++ options: -O3 -std=gnu++17 -fPIC -include -m64

OSPRay

Benchmark: particle_volume/pathtracer/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: particle_volume/pathtracer/real_timeHBv4HBv4 + OptimizationsHBv3HBv3 + OptimizationsHBv2 + OptimizationsHBv2HC + OptimizationsHC50100150200250SE +/- 0.07, N = 3SE +/- 0.81, N = 3SE +/- 0.23, N = 3SE +/- 1.50, N = 7SE +/- 0.83, N = 3SE +/- 3.07, N = 12SE +/- 7.22, N = 9SE +/- 8.14, N = 12208.34208.05168.24167.50162.45157.1396.7686.57

High Performance Conjugate Gradient

X Y Z: 160 160 160 - RT: 60

OpenBenchmarking.orgGFLOP/s, More Is BetterHigh Performance Conjugate Gradient 3.1X Y Z: 160 160 160 - RT: 60HBv4HBv3HBv2HC20406080100SE +/- 0.12, N = 3SE +/- 0.02, N = 3SE +/- 0.01, N = 3SE +/- 0.06, N = 387.9039.1136.0225.561. (CXX) g++ options: -O3 -ffast-math -ftree-vectorize -pthread -lmpi

Blender

Blend File: Barbershop - Compute: CPU-Only

OpenBenchmarking.orgSeconds, Fewer Is BetterBlender 3.6Blend File: Barbershop - Compute: CPU-OnlyHBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv3HBv2HBv2 + OptimizationsHCHC + Optimizations110220330440550SE +/- 0.12, N = 3SE +/- 0.47, N = 3SE +/- 0.38, N = 3SE +/- 0.45, N = 3SE +/- 0.01, N = 3SE +/- 0.22, N = 3SE +/- 2.13, N = 3SE +/- 1.15, N = 396.7797.52188.96189.30210.18211.46524.86526.93

High Performance Conjugate Gradient

X Y Z: 144 144 144 - RT: 60

OpenBenchmarking.orgGFLOP/s, More Is BetterHigh Performance Conjugate Gradient 3.1X Y Z: 144 144 144 - RT: 60HBv4HBv3HBv2HC20406080100SE +/- 0.11, N = 3SE +/- 0.02, N = 3SE +/- 0.02, N = 3SE +/- 0.05, N = 388.5238.9736.0925.871. (CXX) g++ options: -O3 -ffast-math -ftree-vectorize -pthread -lmpi

Timed Node.js Compilation

Time To Compile

OpenBenchmarking.orgSeconds, Fewer Is BetterTimed Node.js Compilation 19.8.1Time To CompileHBv4HBv3HBv2HC70140210280350SE +/- 2.23, N = 12SE +/- 1.46, N = 3SE +/- 1.32, N = 3SE +/- 2.37, N = 3150.56185.57194.37330.61

OSPRay

Benchmark: particle_volume/scivis/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: particle_volume/scivis/real_timeHBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv3HBv2 + OptimizationsHBv2HCHC + Optimizations816243240SE +/- 0.03598, N = 3SE +/- 0.05762, N = 3SE +/- 0.00564, N = 3SE +/- 0.01956, N = 3SE +/- 0.02944, N = 3SE +/- 0.01671, N = 3SE +/- 0.00763, N = 3SE +/- 0.05412, N = 336.5671036.5446024.2197024.1736022.1747022.153308.970208.87831

PostgreSQL

Scaling Factor: 1 - Clients: 500 - Mode: Read Only - Average Latency

OpenBenchmarking.orgms, Fewer Is BetterPostgreSQL 15Scaling Factor: 1 - Clients: 500 - Mode: Read Only - Average LatencyHBv4 + OptimizationsHBv4HBv2HBv2 + OptimizationsHBv3 + OptimizationsHBv3HCHC + Optimizations0.0830.1660.2490.3320.415SE +/- 0.000, N = 3SE +/- 0.000, N = 3SE +/- 0.001, N = 3SE +/- 0.000, N = 3SE +/- 0.002, N = 4SE +/- 0.000, N = 3SE +/- 0.001, N = 3SE +/- 0.001, N = 30.1580.1590.2030.2030.2060.2100.3690.369-O3 -march=native-O2-O2-O3 -march=native-O3 -march=native-O2-O2-O3 -march=native1. (CC) gcc options: -fno-strict-aliasing -fwrapv -lpgcommon -lpgport -lpq -lpthread -lrt -ldl -lm

PostgreSQL

Scaling Factor: 1 - Clients: 500 - Mode: Read Only

OpenBenchmarking.orgTPS, More Is BetterPostgreSQL 15Scaling Factor: 1 - Clients: 500 - Mode: Read OnlyHBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv2HBv3 + OptimizationsHBv3HCHC + Optimizations700K1400K2100K2800K3500KSE +/- 3042.04, N = 3SE +/- 4762.10, N = 3SE +/- 4710.42, N = 3SE +/- 8486.11, N = 3SE +/- 28428.57, N = 4SE +/- 4803.91, N = 3SE +/- 3475.53, N = 3SE +/- 2849.38, N = 331618483139846246732824662492434749237500513548771353510-O3 -march=native-O2-O3 -march=native-O2-O3 -march=native-O2-O2-O3 -march=native1. (CC) gcc options: -fno-strict-aliasing -fwrapv -lpgcommon -lpgport -lpq -lpthread -lrt -ldl -lm

PostgreSQL

Scaling Factor: 1 - Clients: 800 - Mode: Read Only - Average Latency

OpenBenchmarking.orgms, Fewer Is BetterPostgreSQL 15Scaling Factor: 1 - Clients: 800 - Mode: Read Only - Average LatencyHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv2 + OptimizationsHBv2HBv3HCHC + Optimizations0.15530.31060.46590.62120.7765SE +/- 0.000, N = 3SE +/- 0.001, N = 3SE +/- 0.002, N = 3SE +/- 0.001, N = 3SE +/- 0.001, N = 3SE +/- 0.001, N = 3SE +/- 0.003, N = 3SE +/- 0.002, N = 30.2540.2560.3230.3230.3280.3320.6880.690-O3 -march=native-O2-O3 -march=native-O3 -march=native-O2-O2-O2-O3 -march=native1. (CC) gcc options: -fno-strict-aliasing -fwrapv -lpgcommon -lpgport -lpq -lpthread -lrt -ldl -lm

PostgreSQL

Scaling Factor: 1 - Clients: 800 - Mode: Read Only

OpenBenchmarking.orgTPS, More Is BetterPostgreSQL 15Scaling Factor: 1 - Clients: 800 - Mode: Read OnlyHBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv3 + OptimizationsHBv2HBv3HCHC + Optimizations700K1400K2100K2800K3500KSE +/- 2972.36, N = 3SE +/- 20304.79, N = 3SE +/- 9212.17, N = 3SE +/- 13675.06, N = 3SE +/- 4115.38, N = 3SE +/- 11149.78, N = 3SE +/- 4936.18, N = 3SE +/- 2818.34, N = 331461733123042248132024789172439650240760211618001159492-O3 -march=native-O2-O3 -march=native-O3 -march=native-O2-O2-O2-O3 -march=native1. (CC) gcc options: -fno-strict-aliasing -fwrapv -lpgcommon -lpgport -lpq -lpthread -lrt -ldl -lm

OSPRay

Benchmark: gravity_spheres_volume/dim_512/scivis/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: gravity_spheres_volume/dim_512/scivis/real_timeHBv4HBv4 + OptimizationsHBv3HBv3 + OptimizationsHC + OptimizationsHCHBv2 + OptimizationsHBv2918273645SE +/- 0.11164, N = 3SE +/- 0.12574, N = 3SE +/- 0.01165, N = 3SE +/- 0.02977, N = 3SE +/- 0.01641, N = 3SE +/- 0.03491, N = 3SE +/- 0.13284, N = 15SE +/- 0.12026, N = 1537.0918037.0624011.1845011.172309.026898.987238.323238.12356

OSPRay

Benchmark: gravity_spheres_volume/dim_512/ao/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: gravity_spheres_volume/dim_512/ao/real_timeHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv3HC + OptimizationsHCHBv2HBv2 + Optimizations918273645SE +/- 0.02835, N = 3SE +/- 0.03610, N = 3SE +/- 0.01464, N = 3SE +/- 0.03837, N = 3SE +/- 0.03191, N = 3SE +/- 0.02906, N = 3SE +/- 0.13915, N = 12SE +/- 0.15055, N = 1538.0769038.0764011.7501011.748509.522939.494218.673278.66888

OSPRay

Benchmark: particle_volume/ao/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: particle_volume/ao/real_timeHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv3HBv2 + OptimizationsHBv2HC + OptimizationsHC816243240SE +/- 0.04011, N = 3SE +/- 0.04053, N = 3SE +/- 0.00987, N = 3SE +/- 0.01755, N = 3SE +/- 0.00858, N = 3SE +/- 0.00495, N = 3SE +/- 0.01510, N = 3SE +/- 0.01225, N = 336.6548036.6121024.4710024.4586022.3668022.333608.996188.97547

High Performance Conjugate Gradient

X Y Z: 104 104 104 - RT: 60

OpenBenchmarking.orgGFLOP/s, More Is BetterHigh Performance Conjugate Gradient 3.1X Y Z: 104 104 104 - RT: 60HBv4HBv3HBv2HC20406080100SE +/- 0.26, N = 3SE +/- 0.03, N = 3SE +/- 0.03, N = 3SE +/- 0.02, N = 389.3839.6137.0426.001. (CXX) g++ options: -O3 -ffast-math -ftree-vectorize -pthread -lmpi

Blender

Blend File: Pabellon Barcelona - Compute: CPU-Only

OpenBenchmarking.orgSeconds, Fewer Is BetterBlender 3.6Blend File: Pabellon Barcelona - Compute: CPU-OnlyHBv4 + OptimizationsHBv4HBv3HBv3 + OptimizationsHBv2HBv2 + OptimizationsHC + OptimizationsHC4080120160200SE +/- 0.12, N = 3SE +/- 0.06, N = 3SE +/- 0.24, N = 3SE +/- 0.45, N = 3SE +/- 0.10, N = 3SE +/- 0.28, N = 3SE +/- 0.33, N = 3SE +/- 1.13, N = 333.0133.4062.6462.9064.1464.84175.07176.21

Blender

Blend File: Classroom - Compute: CPU-Only

OpenBenchmarking.orgSeconds, Fewer Is BetterBlender 3.6Blend File: Classroom - Compute: CPU-OnlyHBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv2HBv2 + OptimizationsHBv3HC + OptimizationsHC306090120150SE +/- 0.11, N = 3SE +/- 0.11, N = 3SE +/- 0.06, N = 3SE +/- 0.10, N = 3SE +/- 0.15, N = 3SE +/- 0.04, N = 3SE +/- 0.04, N = 3SE +/- 0.49, N = 325.2625.6150.7150.8650.9551.08138.51138.81

ACES DGEMM

Sustained Floating-Point Rate

OpenBenchmarking.orgGFLOP/s, More Is BetterACES DGEMM 1.0Sustained Floating-Point RateHBv4HBv4 + OptimizationsHBv3HBv3 + OptimizationsHCHC + OptimizationsHBv2 + OptimizationsHBv21224364860SE +/- 0.359007, N = 3SE +/- 0.581762, N = 5SE +/- 0.132089, N = 3SE +/- 0.146977, N = 3SE +/- 0.199669, N = 15SE +/- 0.474074, N = 12SE +/- 0.275809, N = 12SE +/- 0.272351, N = 1553.17569152.80244025.10487625.04835214.34083014.0720276.3954155.8999031. (CC) gcc options: -O3 -march=native -fopenmp

NAS Parallel Benchmarks

Test / Class: SP.C

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: SP.CHBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHBv4HC + OptimizationsHBv2HBv3HC90K180K270K360K450KSE +/- 2970.97, N = 15SE +/- 1576.20, N = 3SE +/- 324.54, N = 3SE +/- 954.46, N = 12SE +/- 105.69, N = 3SE +/- 34.59, N = 3SE +/- 273.09, N = 8SE +/- 12.00, N = 3427298.99205795.59104771.9068819.3441543.9432495.8931024.7612907.541. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

Laghos

Test: Sedov Blast Wave, ube_922_hex.mesh

OpenBenchmarking.orgMajor Kernels Total Rate, More Is BetterLaghos 3.1Test: Sedov Blast Wave, ube_922_hex.meshHBv4HBv3HBv2HC90180270360450SE +/- 0.78, N = 3SE +/- 0.15, N = 3SE +/- 3.57, N = 5SE +/- 1.35, N = 3402.94361.81345.14247.491. (CXX) g++ options: -O3 -std=c++11 -lmfem -lHYPRE -lmetis -lrt -pthread -lmpi

OSPRay

Benchmark: gravity_spheres_volume/dim_512/pathtracer/real_time

OpenBenchmarking.orgItems Per Second, More Is BetterOSPRay 2.12Benchmark: gravity_spheres_volume/dim_512/pathtracer/real_timeHBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv3HBv2 + OptimizationsHBv2HC + OptimizationsHC816243240SE +/- 0.04, N = 3SE +/- 0.08, N = 3SE +/- 0.02, N = 3SE +/- 0.03, N = 3SE +/- 0.05, N = 3SE +/- 0.03, N = 3SE +/- 0.02, N = 3SE +/- 0.02, N = 332.7932.5814.6114.6113.9413.9210.0610.05

NAS Parallel Benchmarks

Test / Class: EP.D

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: EP.DHBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv3 + OptimizationsHBv2HBv3HC + OptimizationsHC2K4K6K8K10KSE +/- 17.93, N = 3SE +/- 37.41, N = 3SE +/- 21.10, N = 3SE +/- 2.73, N = 3SE +/- 32.15, N = 6SE +/- 80.22, N = 12SE +/- 11.14, N = 3SE +/- 1.76, N = 39031.465985.755542.084840.073222.822879.081853.471642.031. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

Blender

Blend File: BMW27 - Compute: CPU-Only

OpenBenchmarking.orgSeconds, Fewer Is BetterBlender 3.6Blend File: BMW27 - Compute: CPU-OnlyHBv4HBv4 + OptimizationsHBv3 + OptimizationsHBv2HBv3HBv2 + OptimizationsHC + OptimizationsHC1122334455SE +/- 0.06, N = 3SE +/- 0.08, N = 3SE +/- 0.10, N = 3SE +/- 0.11, N = 3SE +/- 0.02, N = 3SE +/- 0.16, N = 3SE +/- 0.36, N = 3SE +/- 0.65, N = 159.9710.1119.4319.4619.4919.5849.9550.53

libxsmm

M N K: 64

OpenBenchmarking.orgGFLOPS/s, More Is Betterlibxsmm 2-1.17-3645M N K: 64HBv4 + OptimizationsHBv4HBv3HBv3 + OptimizationsHC + OptimizationsHCHBv2HBv2 + Optimizations13002600390052006500SE +/- 74.65, N = 3SE +/- 226.33, N = 12SE +/- 17.54, N = 12SE +/- 8.24, N = 3SE +/- 7.70, N = 3SE +/- 5.15, N = 15SE +/- 18.03, N = 13SE +/- 2.64, N = 155898.25719.02435.62413.7748.1731.6411.7331.41. (CXX) g++ options: -dynamic -Bstatic -static-libgcc -lgomp -lpthread -lm -lrt -ldl -lquadmath -lstdc++ -pthread -fPIC -std=c++14 -pedantic -O2 -fopenmp -fopenmp-simd -funroll-loops -ftree-vectorize -fdata-sections -ffunction-sections -fvisibility=hidden -march=core-avx2

Liquid-DSP

Threads: 32 - Buffer Length: 256 - Filter Length: 57

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 32 - Buffer Length: 256 - Filter Length: 57HBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv2 + OptimizationsHBv2HBv3HCHC + Optimizations300M600M900M1200M1500MSE +/- 11536463.93, N = 3SE +/- 14294460.47, N = 5SE +/- 2931059.72, N = 3SE +/- 33333.33, N = 3SE +/- 472581.56, N = 3SE +/- 550757.05, N = 3SE +/- 5360840.75, N = 11SE +/- 7305771.23, N = 6146320000013905400001347733333125783333311934000001086000000721290909719580000-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

7-Zip Compression

Test: Decompression Rating

OpenBenchmarking.orgMIPS, More Is Better7-Zip Compression 22.01Test: Decompression RatingHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv3HBv2 + OptimizationsHBv2HC + OptimizationsHC160K320K480K640K800KSE +/- 8621.97, N = 3SE +/- 8360.33, N = 15SE +/- 3365.82, N = 3SE +/- 19127.89, N = 3SE +/- 10621.28, N = 3SE +/- 2438.40, N = 3SE +/- 300.63, N = 3SE +/- 256.58, N = 37428597279954065163975053885773710441508411481931. (CXX) g++ options: -lpthread -ldl -O2 -fPIC

7-Zip Compression

Test: Compression Rating

OpenBenchmarking.orgMIPS, More Is Better7-Zip Compression 22.01Test: Compression RatingHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv3HBv2 + OptimizationsHBv2HC + OptimizationsHC200K400K600K800K1000KSE +/- 4158.65, N = 3SE +/- 7680.08, N = 15SE +/- 7198.45, N = 3SE +/- 6724.92, N = 3SE +/- 3504.63, N = 3SE +/- 2650.49, N = 3SE +/- 672.17, N = 3SE +/- 748.55, N = 3108352310322675665955582905015344894562164512107321. (CXX) g++ options: -lpthread -ldl -O2 -fPIC

oneDNN

Harness: Recurrent Neural Network Training - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Recurrent Neural Network Training - Data Type: f32 - Engine: CPUHBv4HCHBv3HBv230060090012001500SE +/- 3.26, N = 3SE +/- 1.60, N = 3SE +/- 3.89, N = 3SE +/- 13.31, N = 3535.85707.35860.981345.14MIN: 521.12MIN: 689.52MIN: 814.31MIN: 1237.171. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

NAS Parallel Benchmarks

Test / Class: BT.C

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: BT.CHBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHBv4HC + OptimizationsHBv2HBv3HC160K320K480K640K800KSE +/- 6061.11, N = 3SE +/- 2034.04, N = 3SE +/- 108.10, N = 3SE +/- 760.56, N = 3SE +/- 62.47, N = 3SE +/- 32.07, N = 3SE +/- 36.56, N = 3SE +/- 15.19, N = 3744413.90313813.98241509.88151067.81106230.5266829.1862427.8628794.281. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

Blender

Blend File: Fishy Cat - Compute: CPU-Only

OpenBenchmarking.orgSeconds, Fewer Is BetterBlender 3.6Blend File: Fishy Cat - Compute: CPU-OnlyHBv4 + OptimizationsHBv4HBv3HBv3 + OptimizationsHBv2HBv2 + OptimizationsHC + OptimizationsHC1632486480SE +/- 0.09, N = 3SE +/- 0.14, N = 3SE +/- 0.08, N = 3SE +/- 0.15, N = 3SE +/- 0.10, N = 3SE +/- 0.04, N = 3SE +/- 0.23, N = 3SE +/- 0.48, N = 313.7413.9625.4725.5926.1926.4371.7672.57

oneDNN

Harness: Recurrent Neural Network Training - Data Type: bf16bf16bf16 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Recurrent Neural Network Training - Data Type: bf16bf16bf16 - Engine: CPUHBv4HCHBv3HBv230060090012001500SE +/- 1.90, N = 3SE +/- 1.51, N = 3SE +/- 6.66, N = 3SE +/- 13.52, N = 15533.49707.32886.811367.73MIN: 518.68MIN: 687.14MIN: 849.061. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

Laghos

Test: Triple Point Problem

OpenBenchmarking.orgMajor Kernels Total Rate, More Is BetterLaghos 3.1Test: Triple Point ProblemHBv4HBv3HBv2HC50100150200250SE +/- 1.25, N = 3SE +/- 0.38, N = 3SE +/- 0.57, N = 3SE +/- 0.08, N = 3228.15192.74183.82156.521. (CXX) g++ options: -O3 -std=c++11 -lmfem -lHYPRE -lmetis -lrt -pthread -lmpi

Liquid-DSP

Threads: 176 - Buffer Length: 256 - Filter Length: 512

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 176 - Buffer Length: 256 - Filter Length: 512HBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv2HBv3 + OptimizationsHBv3HC + OptimizationsHC500M1000M1500M2000M2500MSE +/- 5336145.09, N = 3SE +/- 4603018.33, N = 3SE +/- 3265385.80, N = 3SE +/- 3174614.59, N = 3SE +/- 1919487.78, N = 3SE +/- 3040334.41, N = 3SE +/- 2270626.44, N = 3SE +/- 6341443.93, N = 322219666672058233333924243333825653333814950000735370000544626667529213333-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Liquid-DSP

Threads: 176 - Buffer Length: 256 - Filter Length: 32

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 176 - Buffer Length: 256 - Filter Length: 32HBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv2HBv3 + OptimizationsHBv3HCHC + Optimizations1300M2600M3900M5200M6500MSE +/- 6999365.05, N = 3SE +/- 9214903.39, N = 3SE +/- 25439885.57, N = 3SE +/- 44818002.34, N = 3SE +/- 2858321.19, N = 3SE +/- 8912600.32, N = 3SE +/- 2852094.75, N = 3SE +/- 8873431.00, N = 361817666676122233333427553333340271000003864000000341953333315661333331536633333-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Intel Open Image Denoise

Run: RTLightmap.hdr.4096x4096 - Device: CPU-Only

OpenBenchmarking.orgImages / Sec, More Is BetterIntel Open Image Denoise 2.0Run: RTLightmap.hdr.4096x4096 - Device: CPU-OnlyHBv4 + OptimizationsHBv4HBv2HBv2 + OptimizationsHCHC + OptimizationsHBv3 + OptimizationsHBv30.2970.5940.8911.1881.485SE +/- 0.01, N = 3SE +/- 0.00, N = 3SE +/- 0.01, N = 3SE +/- 0.01, N = 15SE +/- 0.00, N = 3SE +/- 0.00, N = 3SE +/- 0.01, N = 3SE +/- 0.01, N = 31.321.291.040.960.880.870.800.79

Liquid-DSP

Threads: 176 - Buffer Length: 256 - Filter Length: 57

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 176 - Buffer Length: 256 - Filter Length: 57HBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv3 + OptimizationsHBv2HBv3HC + OptimizationsHC1500M3000M4500M6000M7500MSE +/- 36788419.07, N = 3SE +/- 11394345.58, N = 3SE +/- 8195730.60, N = 3SE +/- 8996542.55, N = 3SE +/- 13588352.86, N = 3SE +/- 4247482.91, N = 3SE +/- 7033807.25, N = 3SE +/- 5446813.54, N = 370950333336758166667435010000042815333334106700000356343333316830333331664733333-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Liquid-DSP

Threads: 128 - Buffer Length: 256 - Filter Length: 57

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 128 - Buffer Length: 256 - Filter Length: 57HBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv3 + OptimizationsHBv2HBv3HCHC + Optimizations1200M2400M3600M4800M6000MSE +/- 24008123.63, N = 3SE +/- 10401335.38, N = 3SE +/- 14518991.39, N = 3SE +/- 6263474.36, N = 3SE +/- 4421286.89, N = 3SE +/- 6947661.48, N = 3SE +/- 8373967.60, N = 3SE +/- 4733333.33, N = 354129000005168233333430913333342169666674045933333351630000015724000001570633333-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Liquid-DSP

Threads: 128 - Buffer Length: 256 - Filter Length: 32

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 128 - Buffer Length: 256 - Filter Length: 32HBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv2HBv3 + OptimizationsHBv3HCHC + Optimizations1000M2000M3000M4000M5000MSE +/- 6295324.54, N = 3SE +/- 3774034.09, N = 3SE +/- 14782572.32, N = 3SE +/- 3602930.91, N = 3SE +/- 8235492.29, N = 3SE +/- 5345506.94, N = 3SE +/- 8213606.60, N = 3SE +/- 12143905.65, N = 344672666674426300000419683333339259333333832800000336673333315126000001478433333-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Liquid-DSP

Threads: 32 - Buffer Length: 256 - Filter Length: 32

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 32 - Buffer Length: 256 - Filter Length: 32HBv2 + OptimizationsHBv4 + OptimizationsHBv4HBv2HBv3 + OptimizationsHCHC + OptimizationsHBv3200M400M600M800M1000MSE +/- 66666.67, N = 3SE +/- 1354416.64, N = 3SE +/- 1950213.66, N = 3SE +/- 33333.33, N = 3SE +/- 493288.29, N = 3SE +/- 3947135.39, N = 3SE +/- 1486169.57, N = 3SE +/- 2475306.94, N = 311367333331122866667111330000010614333331045000000964423333948450000917336667-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

Liquid-DSP

Threads: 1 - Buffer Length: 256 - Filter Length: 32

OpenBenchmarking.orgsamples/s, More Is BetterLiquid-DSP 1.6Threads: 1 - Buffer Length: 256 - Filter Length: 32HBv3 + OptimizationsHBv4 + OptimizationsHBv4HBv2 + OptimizationsHBv2HBv3HCHC + Optimizations8M16M24M32M40MSE +/- 50767.44, N = 3SE +/- 1666.67, N = 3SE +/- 20201.76, N = 3SE +/- 17676.10, N = 3SE +/- 2185.81, N = 3SE +/- 4096.07, N = 3SE +/- 1333.33, N = 3SE +/- 1000.00, N = 33717500035693667353626673508066733211667328173333179633331262000-march=native-march=native-march=native-march=native1. (CC) gcc options: -O3 -pthread -lm -lc -lliquid

NAMD

ATPase Simulation - 327,506 Atoms

OpenBenchmarking.orgdays/ns, Fewer Is BetterNAMD 2.14ATPase Simulation - 327,506 AtomsHBv4HBv4 + OptimizationsHBv2HBv2 + OptimizationsHBv3 + OptimizationsHBv3HCHC + Optimizations0.11860.23720.35580.47440.593SE +/- 0.00035, N = 3SE +/- 0.00011, N = 3SE +/- 0.00045, N = 3SE +/- 0.00069, N = 3SE +/- 0.00015, N = 3SE +/- 0.00027, N = 3SE +/- 0.00096, N = 3SE +/- 0.00060, N = 30.142920.143800.263850.265050.271110.271150.526500.52697

oneDNN

Harness: Recurrent Neural Network Inference - Data Type: bf16bf16bf16 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Recurrent Neural Network Inference - Data Type: bf16bf16bf16 - Engine: CPUHBv4HCHBv3HBv22004006008001000SE +/- 3.60, N = 8SE +/- 1.89, N = 3SE +/- 4.36, N = 3SE +/- 9.54, N = 15411.23442.47529.97910.94MIN: 429.93MIN: 469.931. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

NAS Parallel Benchmarks

Test / Class: IS.D

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: IS.DHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv2 + OptimizationsHBv3HBv2HC + OptimizationsHC3K6K9K12K15KSE +/- 308.75, N = 15SE +/- 17.88, N = 3SE +/- 67.99, N = 4SE +/- 35.84, N = 7SE +/- 22.55, N = 3SE +/- 11.15, N = 3SE +/- 7.55, N = 3SE +/- 2.10, N = 312967.375870.005730.013977.022793.551884.221864.681181.481. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

oneDNN

Harness: Recurrent Neural Network Inference - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Recurrent Neural Network Inference - Data Type: f32 - Engine: CPUHBv4HCHBv3HBv22004006008001000SE +/- 1.40, N = 3SE +/- 4.72, N = 3SE +/- 4.61, N = 15SE +/- 9.52, N = 15401.86450.25533.50896.81MIN: 388.53MIN: 432.991. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

libxsmm

M N K: 32

OpenBenchmarking.orgGFLOPS/s, More Is Betterlibxsmm 2-1.17-3645M N K: 32HBv4 + OptimizationsHBv4HBv3HBv3 + OptimizationsHC + OptimizationsHCHBv2HBv2 + Optimizations13002600390052006500SE +/- 87.98, N = 3SE +/- 443.26, N = 12SE +/- 32.59, N = 14SE +/- 38.99, N = 12SE +/- 3.15, N = 9SE +/- 2.82, N = 11SE +/- 3.90, N = 12SE +/- 1.72, N = 36163.05006.81506.31438.1384.9379.9195.1164.81. (CXX) g++ options: -dynamic -Bstatic -static-libgcc -lgomp -lpthread -lm -lrt -ldl -lquadmath -lstdc++ -pthread -fPIC -std=c++14 -pedantic -O2 -fopenmp -fopenmp-simd -funroll-loops -ftree-vectorize -fdata-sections -ffunction-sections -fvisibility=hidden -march=core-avx2

Intel Open Image Denoise

Run: RT.ldr_alb_nrm.3840x2160 - Device: CPU-Only

OpenBenchmarking.orgImages / Sec, More Is BetterIntel Open Image Denoise 2.0Run: RT.ldr_alb_nrm.3840x2160 - Device: CPU-OnlyHBv4HBv4 + OptimizationsHBv2HBv2 + OptimizationsHC + OptimizationsHCHBv3 + OptimizationsHBv30.70431.40862.11292.81723.5215SE +/- 0.01, N = 3SE +/- 0.02, N = 3SE +/- 0.02, N = 9SE +/- 0.01, N = 3SE +/- 0.00, N = 3SE +/- 0.01, N = 3SE +/- 0.01, N = 15SE +/- 0.01, N = 153.133.082.032.011.851.841.691.69

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: double-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: double-long - X Y Z: 512HBv4HBv3HBv2HC306090120150SE +/- 0.03, N = 3SE +/- 0.03, N = 3SE +/- 0.09, N = 3SE +/- 0.02, N = 3154.5756.2746.9331.581. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: double - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: double - X Y Z: 512HBv4HBv3HBv2HC306090120150SE +/- 0.27, N = 3SE +/- 0.04, N = 3SE +/- 0.05, N = 3SE +/- 0.02, N = 3154.6556.2246.9831.571. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 512HBv4HBv3HBv2HC4080120160200SE +/- 0.05, N = 3SE +/- 0.04, N = 3SE +/- 0.02, N = 3SE +/- 0.02, N = 3159.2657.2347.3733.551. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double - X Y Z: 512HBv4HBv3HBv2HC4080120160200SE +/- 0.34, N = 3SE +/- 0.07, N = 3SE +/- 0.09, N = 3SE +/- 0.03, N = 3159.1857.3347.6133.521. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 256HBv4HBv3HBv2HC90180270360450SE +/- 10.91, N = 15SE +/- 3.45, N = 15SE +/- 3.34, N = 12SE +/- 0.53, N = 3427.10221.86200.04122.771. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: float - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: float - X Y Z: 256HBv4HBv3HBv2HC100200300400500SE +/- 14.34, N = 15SE +/- 5.19, N = 15SE +/- 2.79, N = 12SE +/- 0.57, N = 3459.92214.06205.21134.761. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double - X Y Z: 128

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double - X Y Z: 128HBv4HBv2HBv3HC20406080100SE +/- 3.67, N = 15SE +/- 1.72, N = 15SE +/- 1.84, N = 15SE +/- 0.65, N = 580.2559.4259.3859.141. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: float-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: float-long - X Y Z: 256HBv4HBv2HBv3HC100200300400500SE +/- 17.46, N = 12SE +/- 2.37, N = 15SE +/- 7.34, N = 15SE +/- 0.90, N = 3467.72211.42207.97131.961. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: double - X Y Z: 128

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: double - X Y Z: 128HBv4HBv2HBv3HC20406080100SE +/- 3.68, N = 14SE +/- 1.33, N = 15SE +/- 1.12, N = 15SE +/- 0.30, N = 387.6651.4050.6141.731. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: double - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: double - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 5.66, N = 15SE +/- 0.75, N = 15SE +/- 1.31, N = 3SE +/- 0.25, N = 3261.90103.2591.9257.311. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 2.84, N = 15SE +/- 0.81, N = 15SE +/- 1.27, N = 3SE +/- 0.19, N = 3258.72105.5092.3960.891. (CXX) g++ options: -O3 -pthread

NAS Parallel Benchmarks

Test / Class: MG.C

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: MG.CHBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHBv4HC + OptimizationsHBv3HBv2HC90K180K270K360K450KSE +/- 5249.92, N = 15SE +/- 1313.15, N = 15SE +/- 768.30, N = 3SE +/- 748.94, N = 13SE +/- 149.23, N = 3SE +/- 613.84, N = 15SE +/- 354.81, N = 3SE +/- 24.47, N = 3437417.16131635.41108985.72108125.8663404.0146705.4743410.7119508.001. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 3.64, N = 15SE +/- 1.13, N = 3SE +/- 0.74, N = 15SE +/- 0.16, N = 3255.97105.0990.7958.551. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 0.96, N = 3SE +/- 0.05, N = 3SE +/- 0.23, N = 3SE +/- 0.06, N = 3323.70124.6093.2657.921. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: float - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: float - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 0.80, N = 3SE +/- 0.73, N = 3SE +/- 0.34, N = 3SE +/- 0.02, N = 3323.36123.2493.7957.761. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 4.03, N = 14SE +/- 1.05, N = 3SE +/- 1.12, N = 15SE +/- 0.12, N = 3273.12106.6388.6157.131. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: double-long - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 0.81, N = 3SE +/- 0.49, N = 3SE +/- 0.16, N = 3SE +/- 0.03, N = 3311.27118.2495.2059.901. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: double - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: double - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 1.60, N = 3SE +/- 0.40, N = 3SE +/- 0.25, N = 3SE +/- 0.05, N = 3311.80117.7394.5359.821. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: double-long - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 1.65, N = 3SE +/- 0.04, N = 3SE +/- 0.07, N = 3SE +/- 0.06, N = 3315.98120.9691.4360.821. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: double - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: double - X Y Z: 512HBv4HBv3HBv2HC70140210280350SE +/- 0.50, N = 3SE +/- 0.86, N = 3SE +/- 0.15, N = 3SE +/- 0.05, N = 3314.34121.2891.4860.881. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: double - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: double - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 4.27, N = 12SE +/- 0.80, N = 15SE +/- 1.10, N = 4SE +/- 0.08, N = 3264.95102.7093.3160.571. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: float - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: float - X Y Z: 512HBv4HBv3HBv2HC80160240320400SE +/- 1.24, N = 3SE +/- 0.93, N = 3SE +/- 0.47, N = 3SE +/- 0.04, N = 3355.86135.6995.8862.981. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: float-long - X Y Z: 512HBv4HBv3HBv2HC80160240320400SE +/- 1.18, N = 3SE +/- 0.58, N = 3SE +/- 0.05, N = 3SE +/- 0.04, N = 3355.51135.9596.4962.901. (CXX) g++ options: -O3 -pthread

NAS Parallel Benchmarks

Test / Class: CG.C

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: CG.CHBv4 + OptimizationsHBv4HBv3 + OptimizationsHBv2 + OptimizationsHC + OptimizationsHBv2HBv3HC16K32K48K64K80KSE +/- 599.32, N = 3SE +/- 77.41, N = 3SE +/- 503.29, N = 3SE +/- 778.45, N = 15SE +/- 218.98, N = 3SE +/- 108.02, N = 3SE +/- 20.87, N = 3SE +/- 233.39, N = 1574101.9440326.2936681.4336367.3527619.0522314.0221551.4814356.201. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: float - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: float - X Y Z: 256HBv4HBv3HBv2HC50100150200250SE +/- 3.04, N = 4SE +/- 0.77, N = 15SE +/- 0.61, N = 15SE +/- 0.02, N = 3244.34103.4191.2659.731. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: float - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: float - X Y Z: 256HBv4HBv3HBv2HC60120180240300SE +/- 1.07, N = 3SE +/- 1.41, N = 15SE +/- 0.67, N = 15SE +/- 0.07, N = 3256.35103.5191.5458.361. (CXX) g++ options: -O3 -pthread

NAS Parallel Benchmarks

Test / Class: FT.C

OpenBenchmarking.orgTotal Mop/s, More Is BetterNAS Parallel Benchmarks 3.4Test / Class: FT.CHBv4 + OptimizationsHBv3 + OptimizationsHBv2 + OptimizationsHBv4HC + OptimizationsHBv2HBv3HC50K100K150K200K250KSE +/- 1773.50, N = 3SE +/- 339.33, N = 3SE +/- 320.45, N = 3SE +/- 745.61, N = 3SE +/- 131.36, N = 3SE +/- 219.43, N = 3SE +/- 194.34, N = 3SE +/- 13.57, N = 3230164.79102122.3698485.2369051.6355288.1941977.6936619.2920188.891. (F9X) gfortran options: -O3 -march=native -pthread -lmpi_usempif08 -lmpi_mpifh -lmpi

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 256HBv4HBv2HBv3HC306090120150SE +/- 1.21, N = 15SE +/- 0.57, N = 3SE +/- 0.33, N = 3SE +/- 0.05, N = 3122.9851.2039.3730.221. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 128

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double-long - X Y Z: 128HBv4HBv2HCHBv320406080100SE +/- 4.77, N = 15SE +/- 1.30, N = 15SE +/- 0.23, N = 3SE +/- 0.34, N = 385.0161.1458.9156.871. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: float - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: float - X Y Z: 256HBv4HBv2HBv3HC100200300400500SE +/- 14.97, N = 12SE +/- 1.85, N = 3SE +/- 5.11, N = 15SE +/- 0.52, N = 3442.83203.77198.66123.631. (CXX) g++ options: -O3 -pthread

Intel Open Image Denoise

Run: RT.hdr_alb_nrm.3840x2160 - Device: CPU-Only

OpenBenchmarking.orgImages / Sec, More Is BetterIntel Open Image Denoise 2.0Run: RT.hdr_alb_nrm.3840x2160 - Device: CPU-OnlyHBv4 + OptimizationsHBv4HBv2HBv2 + OptimizationsHC + OptimizationsHCHBv3 + OptimizationsHBv30.69981.39962.09942.79923.499SE +/- 0.03, N = 3SE +/- 0.02, N = 3SE +/- 0.01, N = 3SE +/- 0.01, N = 3SE +/- 0.01, N = 3SE +/- 0.02, N = 3SE +/- 0.02, N = 4SE +/- 0.02, N = 33.113.082.082.031.851.821.721.68

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: float-long - X Y Z: 256HBv4HBv3HBv2HC50100150200250SE +/- 4.85, N = 15SE +/- 1.07, N = 6SE +/- 1.33, N = 3SE +/- 0.27, N = 3247.73105.3692.1359.551. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: float - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: float - X Y Z: 512HBv4HBv3HBv2HC130260390520650SE +/- 2.25, N = 3SE +/- 2.52, N = 6SE +/- 1.03, N = 3SE +/- 0.09, N = 3622.58254.25191.78114.031. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: double - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: double - X Y Z: 256HBv4HBv2HBv3HC306090120150SE +/- 1.20, N = 3SE +/- 0.29, N = 3SE +/- 0.29, N = 11SE +/- 0.08, N = 3121.6150.7138.4530.171. (CXX) g++ options: -O3 -pthread

Remhos

Test: Sample Remap Example

OpenBenchmarking.orgSeconds, Fewer Is BetterRemhos 1.0Test: Sample Remap ExampleHBv2HBv3HBv4HC612182430SE +/- 0.07, N = 3SE +/- 0.02, N = 3SE +/- 0.14, N = 3SE +/- 0.06, N = 314.9315.2615.3727.381. (CXX) g++ options: -O3 -std=c++11 -lmfem -lHYPRE -lmetis -lrt -pthread -lmpi

Pennant

Test: sedovbig

OpenBenchmarking.orgHydro Cycle Time - Seconds, Fewer Is BetterPennant 1.0.1Test: sedovbigHBv4HBv2HBv3HC612182430SE +/- 0.018282, N = 3SE +/- 0.011742, N = 3SE +/- 0.027453, N = 3SE +/- 0.026763, N = 33.5813915.9158056.27710725.0195601. (CXX) g++ options: -fopenmp -pthread -lmpi

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: float - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: float - X Y Z: 512HBv4HBv3HBv2HC130260390520650SE +/- 2.14, N = 3SE +/- 1.85, N = 3SE +/- 2.04, N = 3SE +/- 0.06, N = 3596.23232.17190.95110.051. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: Stock - Precision: float-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: Stock - Precision: float-long - X Y Z: 512HBv4HBv3HBv2HC130260390520650SE +/- 2.49, N = 3SE +/- 0.15, N = 3SE +/- 1.02, N = 3SE +/- 0.10, N = 3590.93233.80189.21110.201. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 512

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: r2c - Backend: FFTW - Precision: float-long - X Y Z: 512HBv4HBv3HBv2HC130260390520650SE +/- 4.23, N = 3SE +/- 2.91, N = 3SE +/- 1.39, N = 3SE +/- 0.18, N = 3624.95257.42191.14113.941. (CXX) g++ options: -O3 -pthread

oneDNN

Harness: IP Shapes 1D - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: IP Shapes 1D - Data Type: f32 - Engine: CPUHBv4HCHBv3HBv20.31670.63340.95011.26681.5835SE +/- 0.001421, N = 3SE +/- 0.000702, N = 3SE +/- 0.013826, N = 12SE +/- 0.014464, N = 30.7529290.8824460.9100911.407580MIN: 0.69MIN: 0.83MIN: 1.111. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

Pennant

Test: leblancbig

OpenBenchmarking.orgHydro Cycle Time - Seconds, Fewer Is BetterPennant 1.0.1Test: leblancbigHBv4HBv2HBv3HC3691215SE +/- 0.029043, N = 3SE +/- 0.009233, N = 3SE +/- 0.006682, N = 3SE +/- 0.017495, N = 32.1220743.4668853.64931710.6454801. (CXX) g++ options: -fopenmp -pthread -lmpi

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: Stock - Precision: double-long - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: Stock - Precision: double-long - X Y Z: 256HBv4HBv2HBv3HC306090120150SE +/- 1.16, N = 3SE +/- 0.55, N = 3SE +/- 0.14, N = 3SE +/- 0.03, N = 3123.4150.0838.5730.271. (CXX) g++ options: -O3 -pthread

HeFFTe - Highly Efficient FFT for Exascale

Test: c2c - Backend: FFTW - Precision: double - X Y Z: 256

OpenBenchmarking.orgGFLOP/s, More Is BetterHeFFTe - Highly Efficient FFT for Exascale 2.3Test: c2c - Backend: FFTW - Precision: double - X Y Z: 256HBv4HBv2HBv3HC306090120150SE +/- 1.65, N = 3SE +/- 0.55, N = 3SE +/- 0.14, N = 3SE +/- 0.08, N = 3123.3950.9039.8130.121. (CXX) g++ options: -O3 -pthread

oneDNN

Harness: IP Shapes 3D - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: IP Shapes 3D - Data Type: f32 - Engine: CPUHBv4HBv3HCHBv2246810SE +/- 0.002422, N = 3SE +/- 0.039917, N = 15SE +/- 0.093711, N = 12SE +/- 0.032665, N = 30.3061410.6242332.0792006.838250MIN: 5.971. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

oneDNN

Harness: Convolution Batch Shapes Auto - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Convolution Batch Shapes Auto - Data Type: f32 - Engine: CPUHBv4HBv3HBv2HC0.71.42.12.83.5SE +/- 0.000440, N = 3SE +/- 0.001799, N = 3SE +/- 0.002431, N = 3SE +/- 0.015370, N = 30.2764720.5567410.5738783.111210MIN: 0.5MIN: 0.47MIN: 1.731. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl

oneDNN

Harness: Deconvolution Batch shapes_3d - Data Type: f32 - Engine: CPU

OpenBenchmarking.orgms, Fewer Is BetteroneDNN 3.1Harness: Deconvolution Batch shapes_3d - Data Type: f32 - Engine: CPUHBv4HCHBv3HBv20.36230.72461.08691.44921.8115SE +/- 0.001551, N = 3SE +/- 0.002723, N = 3SE +/- 0.003506, N = 3SE +/- 0.021847, N = 30.5828061.2448001.4086201.610020MIN: 0.56MIN: 1.22MIN: 1.36MIN: 1.491. (CXX) g++ options: -O3 -march=native -fopenmp -msse4.1 -fPIC -pie -lpthread -ldl


Phoronix Test Suite v10.8.4