CUDA NVIDIA Tegra X1 GPGPU Linux Tests

Benchmarks by Michael Larabel for a future article on Phoronix.com just delivering various GPGPU benchmarks for reference purposes.

Compare your own system(s) to this result file with the Phoronix Test Suite by running the command: phoronix-test-suite benchmark 1812291-SK-1812291SK27
Jump To Table - Results

View

Do Not Show Noisy Results
Do Not Show Results With Incomplete Data
Do Not Show Results With Little Change/Spread
List Notable Results

Statistics

Show Overall Harmonic Mean(s)
Show Overall Geometric Mean
Show Wins / Losses Counts (Pie Chart)
Normalize Results
Remove Outliers Before Calculating Averages

Graph Settings

Force Line Graphs Where Applicable
Convert To Scalar Where Applicable
Disable Color Branding
Prefer Vertical Bar Graphs

Additional Graphs

Show Perf Per Core/Thread Calculation Graphs Where Applicable
Show Perf Per Clock Calculation Graphs Where Applicable

Multi-Way Comparison

Condense Multi-Option Tests Into Single Result Graphs

Table

Show Detailed System Result Table

Run Management

Highlight
Result
Hide
Result
Result
Identifier
Performance Per
Dollar
Date
Run
  Test
  Duration
Jetson TX1
November 13 2015
 
Jetson TX2 Hogh-P
December 29 2018
  3 Hours, 28 Minutes
NVIDIA GTX 650 Ti
December 29 2018
  1 Minute
Invert Hiding All Results Option
  1 Hour, 10 Minutes

Only show results where is faster than
Only show results matching title/arguments (delimit multiple options with a comma):
Do not show results matching title/arguments (delimit multiple options with a comma):


CUDA NVIDIA Tegra X1 GPGPU Linux TestsProcessorMotherboardMemoryDiskGraphicsChipsetAudioMonitorNetworkOSKernelDesktopDisplay ServerDisplay DriverCompilerFile-SystemScreen ResolutionOpenCLJetson TX1Jetson TX2 Hogh-PNVIDIA GTX 650 TiCortex A57 rev 1 @ 1.91GHz (4 Cores)jetson_tx14096MB16GB 016G32 + 16GB SL16GNVIDIA TEGRAUbuntu 14.043.10.67-g3a5c467 (aarch64)Unity 7.2.2X Server 1.15.1NVIDIA 1.0.0GCC 4.8.4 + CUDA 7.0ext43840x2160ARMv8 rev 3 @ 2.04GHz (6 Cores)quill8192MB31GB 032G34Ubuntu 16.044.4.38-tegra (aarch64)Unity 7.4.5X Server 1.18.4GCC 5.4.0 20160609 + CUDA 9.01366x768Intel Core i5-2400 @ 3.40GHz (4 Cores)ASUS P8H67-M PRO (3802 BIOS)Intel 2nd Generation Core Family DRAM16384MB1000GB Western Digital WD10EALX-009 + 250GB Western Digital WD2500AAKX-7 + SSD 240GBIntel 2nd Generation Core Family IGP 981MBRealtek ALC892DELL E178WFPRealtek RTL8111/8168/84114.4.0-140-generic (x86_64)modesetting 1.18.4OpenCL 1.2 CUDA 10.0.211GCC 5.5.0 20171010 + CUDA 9.0OpenBenchmarking.orgCompiler Details- Jetson TX1: --build=arm-linux-gnueabihf --disable-browser-plugin --disable-libitm --disable-libmudflap --disable-libquadmath --disable-sjlj-exceptions --disable-werror --enable-checking=release --enable-clocale=gnu --enable-gnu-unique-object --enable-gtk-cairo --enable-java-awt=gtk --enable-java-home --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc --enable-plugin --enable-shared --enable-threads=posix --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf --with-arch-directory=arm --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 --with-mode=thumb -v - Jetson TX2 Hogh-P: --build=aarch64-linux-gnu --disable-browser-plugin --disable-libquadmath --disable-werror --enable-checking=release --enable-clocale=gnu --enable-fix-cortex-a53-843419 --enable-gnu-unique-object --enable-gtk-cairo --enable-java-awt=gtk --enable-java-home --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-nls --enable-plugin --enable-shared --enable-threads=posix --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-arch-directory=aarch64 --with-default-libstdcxx-abi=new -v - NVIDIA GTX 650 Ti: --build=x86_64-linux-gnu --disable-browser-plugin --disable-vtable-verify --disable-werror --enable-checking=release --enable-clocale=gnu --enable-gnu-unique-object --enable-gtk-cairo --enable-java-awt=gtk --enable-java-home --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --enable-libmpx --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc --enable-plugin --enable-shared --enable-threads=posix --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-abi=m64 --with-arch-32=i686 --with-arch-directory=amd64 --with-default-libstdcxx-abi=new --with-multilib-list=m32,m64,mx32 --with-tune=generic -v Processor Details- Jetson TX1: Scaling Governor: tegra interactive- Jetson TX2 Hogh-P: Scaling Governor: tegra_cpufreq schedutil- NVIDIA GTX 650 Ti: Scaling Governor: intel_pstate powersaveSecurity Details- NVIDIA GTX 650 Ti: KPTI + __user pointer sanitization + Full generic retpoline IBPB (Intel v4) IBRS_FW + SSB disabled via prctl and seccomp + PTE Inversion

CUDA NVIDIA Tegra X1 GPGPU Linux Testsshoc: CUDA - FFT SPshoc: CUDA - MD5 Hashshoc: CUDA - Texture Read Bandwidthaskap: Griddingaskap: Degriddingcuda-mini-nbody: Originalcuda-mini-nbody: Cache Blockingcuda-mini-nbody: Loop Unrollingcuda-mini-nbody: SOA Data Layoutcuda-mini-nbody: Flush Denormals To ZeroJetson TX1Jetson TX2 Hogh-PNVIDIA GTX 650 Ti3.920.6246.622636495132772365305388.240.9879.0490515137811762005575542.14OpenBenchmarking.org

SHOC Scalable HeterOgeneous Computing

OpenBenchmarking.orgGFLOPS, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: FFT SPJetson TX2 Hogh-PJetson TX1246810SE +/- 0.02, N = 3SE +/- 0.23, N = 68.243.921. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGFLOPS, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: FFT SPJetson TX2 Hogh-PJetson TX13691215Min: 8.2 / Avg: 8.24 / Max: 8.26Min: 3.06 / Avg: 3.92 / Max: 4.391. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

OpenBenchmarking.orgGHash/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: MD5 HashJetson TX2 Hogh-PJetson TX10.22050.4410.66150.8821.1025SE +/- 0.00, N = 3SE +/- 0.00, N = 30.980.621. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGHash/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: MD5 HashJetson TX2 Hogh-PJetson TX1246810Min: 0.98 / Avg: 0.98 / Max: 0.98Min: 0.62 / Avg: 0.62 / Max: 0.621. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: Texture Read BandwidthJetson TX2 Hogh-PJetson TX120406080100SE +/- 0.77, N = 12SE +/- 0.84, N = 379.0446.621. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: Texture Read BandwidthJetson TX2 Hogh-PJetson TX11530456075Min: 75.76 / Avg: 79.04 / Max: 83.43Min: 45.18 / Avg: 46.62 / Max: 48.111. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

ASKAP tConvolveCuda

This is a CUDA benchmark of ATNF's ASKAP Benchmark with currently using the tConvolveCuda sub-test. Learn more via the OpenBenchmarking.org test page.

OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: GriddingJetson TX2 Hogh-PJetson TX12004006008001000SE +/- 1.02, N = 3SE +/- 7.41, N = 69052631. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl
OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: GriddingJetson TX2 Hogh-PJetson TX1160320480640800Min: 902.56 / Avg: 904.61 / Max: 905.63Min: 239.44 / Avg: 262.83 / Max: 281.161. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl

OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: DegriddingJetson TX2 Hogh-PJetson TX130060090012001500SE +/- 4.96, N = 3SE +/- 7.47, N = 315136491. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl
OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: DegriddingJetson TX2 Hogh-PJetson TX130060090012001500Min: 1504.27 / Avg: 1512.85 / Max: 1521.46Min: 641.58 / Avg: 649.05 / Max: 663.981. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl

CUDA Mini-Nbody

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: OriginalJetson TX2 Hogh-PJetson TX12004006008001000SE +/- 9.10, N = 9SE +/- 9.09, N = 6781513
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: OriginalJetson TX2 Hogh-PJetson TX1140280420560700Min: 747.95 / Avg: 780.53 / Max: 817.53Min: 489.28 / Avg: 513.47 / Max: 540.29

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Cache BlockingJetson TX2 Hogh-PJetson TX160120180240300SE +/- 0.10, N = 3SE +/- 7.87, N = 6176277
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Cache BlockingJetson TX2 Hogh-PJetson TX150100150200250Min: 175.46 / Avg: 175.59 / Max: 175.8Min: 252.16 / Avg: 277.48 / Max: 295.53

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Loop UnrollingNVIDIA GTX 650 TiJetson TX2 Hogh-PJetson TX150100150200250SE +/- 0.02, N = 3SE +/- 0.99, N = 3SE +/- 3.29, N = 62.14200.00236.00
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Loop UnrollingNVIDIA GTX 650 TiJetson TX2 Hogh-PJetson TX14080120160200Min: 2.11 / Avg: 2.14 / Max: 2.19Min: 198.75 / Avg: 199.79 / Max: 201.77Min: 224.42 / Avg: 236.4 / Max: 247.14

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: SOA Data LayoutJetson TX2 Hogh-PJetson TX1120240360480600SE +/- 7.78, N = 3SE +/- 8.35, N = 3557530
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: SOA Data LayoutJetson TX2 Hogh-PJetson TX1100200300400500Min: 541.64 / Avg: 557.18 / Max: 565.52Min: 512.9 / Avg: 529.59 / Max: 538.32

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Flush Denormals To ZeroJetson TX2 Hogh-PJetson TX1120240360480600SE +/- 9.27, N = 4SE +/- 5.74, N = 3554538
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Flush Denormals To ZeroJetson TX2 Hogh-PJetson TX1100200300400500Min: 535.97 / Avg: 553.83 / Max: 573.33Min: 527.14 / Avg: 538.07 / Max: 546.56

10 Results Shown

SHOC Scalable HeterOgeneous Computing:
  CUDA - FFT SP
  CUDA - MD5 Hash
  CUDA - Texture Read Bandwidth
ASKAP tConvolveCuda:
  Gridding
  Degridding
CUDA Mini-Nbody:
  Original
  Cache Blocking
  Loop Unrolling
  SOA Data Layout
  Flush Denormals To Zero