CUDA NVIDIA Tegra X1 GPGPU Linux Tests

Benchmarks by Michael Larabel for a future article on Phoronix.com just delivering various GPGPU benchmarks for reference purposes.

Compare your own system(s) to this result file with the Phoronix Test Suite by running the command: phoronix-test-suite benchmark 1703105-RI-1511154HA83
Jump To Table - Results

View

Do Not Show Noisy Results
Do Not Show Results With Incomplete Data
Do Not Show Results With Little Change/Spread
List Notable Results

Statistics

Show Overall Harmonic Mean(s)
Show Overall Geometric Mean
Show Wins / Losses Counts (Pie Chart)
Normalize Results
Remove Outliers Before Calculating Averages

Graph Settings

Force Line Graphs Where Applicable
Convert To Scalar Where Applicable
Prefer Vertical Bar Graphs

Additional Graphs

Show Perf Per Core/Thread Calculation Graphs Where Applicable
Show Perf Per Clock Calculation Graphs Where Applicable

Multi-Way Comparison

Condense Multi-Option Tests Into Single Result Graphs

Table

Show Detailed System Result Table

Run Management

Highlight
Result
Hide
Result
Result
Identifier
Performance Per
Dollar
Date
Run
  Test
  Duration
Jetson TX1
November 13 2015
 
Desktop
March 10 2017
 
Invert Hiding All Results Option
 
Only show results matching title/arguments (delimit multiple options with a comma):
Do not show results matching title/arguments (delimit multiple options with a comma):


CUDA NVIDIA Tegra X1 GPGPU Linux TestsProcessorMotherboardMemoryDiskGraphicsChipsetAudioNetworkOSKernelDesktopDisplay ServerDisplay DriverCompilerFile-SystemScreen ResolutionOpenGLVulkanJetson TX1DesktopCortex A57 rev 1 @ 1.91GHz (4 Cores)jetson_tx14096MB16GB 016G32 + 16GB SL16GNVIDIA TEGRAUbuntu 14.043.10.67-g3a5c467 (aarch64)Unity 7.2.2X Server 1.15.1NVIDIA 1.0.0GCC 4.8.4 + CUDA 7.0ext43840x2160Intel Core i7-7700K @ 4.20GHz (8 Cores)ASRock Z270 Extreme4Intel Device 591f32768MB525GB Crucial_CT525MX3 + 3001GB TOSHIBA DT01ACA3NVIDIA GeForce GTX 1080 8192MB (101/405MHz)Realtek GenericIntel ConnectionUbuntu 16.044.4.0-66-generic (x86_64)Unity 7.4.0X Server 1.18.4NVIDIA 375.264.5.01.0.24GCC 5.4.0 20160609 + CUDA 8.03840x1080OpenBenchmarking.orgCompiler Details- Jetson TX1: --build=arm-linux-gnueabihf --disable-browser-plugin --disable-libitm --disable-libmudflap --disable-libquadmath --disable-sjlj-exceptions --disable-werror --enable-checking=release --enable-clocale=gnu --enable-gnu-unique-object --enable-gtk-cairo --enable-java-awt=gtk --enable-java-home --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc --enable-plugin --enable-shared --enable-threads=posix --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf --with-arch-directory=arm --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 --with-mode=thumb -v - Desktop: --build=x86_64-linux-gnu --disable-browser-plugin --disable-vtable-verify --disable-werror --enable-checking=release --enable-clocale=gnu --enable-gnu-unique-object --enable-gtk-cairo --enable-java-awt=gtk --enable-java-home --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --enable-libmpx --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-multiarch --enable-multilib --enable-nls --enable-objc-gc --enable-plugin --enable-shared --enable-threads=posix --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-abi=m64 --with-arch-32=i686 --with-arch-directory=amd64 --with-default-libstdcxx-abi=new --with-multilib-list=m32,m64,mx32 --with-tune=generic -v Processor Details- Jetson TX1: Scaling Governor: tegra interactive- Desktop: Scaling Governor: acpi-cpufreq ondemandOpenCL Details- Desktop: GPU Compute Cores: 2560System Details- Desktop: GPU Compute Cores: 2560.

Jetson TX1 vs. Desktop ComparisonPhoronix Test SuiteBaseline+3122.1%+3122.1%+6244.2%+6244.2%+9366.3%+9366.3%3167.9%2268.4%2041%1964.5%1913.7%1885%1668.1%1585.2%12488.3%1083.4%GriddingDegriddingCache BlockingCUDA - MD5 HashF.D.T.ZSOA Data LayoutLoop UnrollingOriginalCUDA - FFT SPCUDA - T.R.BASKAP tConvolveCudaASKAP tConvolveCudaCUDA Mini-NbodySHOC Scalable HeterOgeneous ComputingCUDA Mini-NbodyCUDA Mini-NbodyCUDA Mini-NbodyCUDA Mini-NbodySHOC Scalable HeterOgeneous ComputingSHOC Scalable HeterOgeneous ComputingJetson TX1Desktop

CUDA NVIDIA Tegra X1 GPGPU Linux Testsshoc: CUDA - FFT SPshoc: CUDA - MD5 Hashshoc: CUDA - Texture Read Bandwidthaskap: Griddingaskap: Degriddingcuda-mini-nbody: Originalcuda-mini-nbody: Cache Blockingcuda-mini-nbody: Loop Unrollingcuda-mini-nbody: SOA Data Layoutcuda-mini-nbody: Flush Denormals To ZeroJetson TX1Desktop3.920.6246.62262.83649.05513.47277.48236.40529.59538.07493.4612.80551.688588.9015372.0730.4712.9613.3726.6826.72OpenBenchmarking.org

SHOC Scalable HeterOgeneous Computing

OpenBenchmarking.orgGFLOPS, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: FFT SPJetson TX1Desktop110220330440550SE +/- 0.23, N = 6SE +/- 2.73, N = 33.92493.461. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGFLOPS, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: FFT SPJetson TX1Desktop90180270360450Min: 3.06 / Avg: 3.92 / Max: 4.39Min: 488.14 / Avg: 493.46 / Max: 497.241. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

OpenBenchmarking.orgGHash/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: MD5 HashJetson TX1Desktop3691215SE +/- 0.00, N = 3SE +/- 0.00, N = 30.6212.801. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGHash/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: MD5 HashJetson TX1Desktop48121620Min: 0.62 / Avg: 0.62 / Max: 0.62Min: 12.8 / Avg: 12.8 / Max: 12.811. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: Texture Read BandwidthJetson TX1Desktop120240360480600SE +/- 0.84, N = 3SE +/- 2.47, N = 346.62551.681. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft
OpenBenchmarking.orgGB/s, More Is BetterSHOC Scalable HeterOgeneous Computing 2015-11-10Target: CUDA - Benchmark: Texture Read BandwidthJetson TX1Desktop100200300400500Min: 45.18 / Avg: 46.62 / Max: 48.11Min: 546.75 / Avg: 551.68 / Max: 554.321. (CXX) g++ options: -O2 -lSHOCCommon -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lcufft

ASKAP tConvolveCuda

This is a CUDA benchmark of ATNF's ASKAP Benchmark with currently using the tConvolveCuda sub-test. Learn more via the OpenBenchmarking.org test page.

OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: GriddingJetson TX1Desktop2K4K6K8K10KSE +/- 7.41, N = 6SE +/- 0.00, N = 3262.838588.90-m641. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl
OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: GriddingJetson TX1Desktop15003000450060007500Min: 239.44 / Avg: 262.83 / Max: 281.16Min: 8588.9 / Avg: 8588.9 / Max: 8588.91. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl

OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: DegriddingJetson TX1Desktop3K6K9K12K15KSE +/- 7.47, N = 3SE +/- 290.03, N = 3649.0515372.07-m641. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl
OpenBenchmarking.orgMillion Grid Points Per Second, More Is BetterASKAP tConvolveCuda 2015-11-10Processing: DegriddingJetson TX1Desktop3K6K9K12K15KMin: 641.58 / Avg: 649.05 / Max: 663.98Min: 14792 / Avg: 15372.07 / Max: 15662.11. (CXX) g++ options: -fPIC -O3 -lcudadevrt -lcudart_static -lrt -lpthread -ldl

CUDA Mini-Nbody

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: OriginalJetson TX1Desktop110220330440550SE +/- 9.09, N = 6SE +/- 0.07, N = 3513.4730.47
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: OriginalJetson TX1Desktop90180270360450Min: 489.28 / Avg: 513.47 / Max: 540.29Min: 30.34 / Avg: 30.47 / Max: 30.57

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Cache BlockingJetson TX1Desktop60120180240300SE +/- 7.87, N = 6SE +/- 0.02, N = 3277.4812.96
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Cache BlockingJetson TX1Desktop50100150200250Min: 252.16 / Avg: 277.48 / Max: 295.53Min: 12.94 / Avg: 12.96 / Max: 13

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Loop UnrollingJetson TX1Desktop50100150200250SE +/- 3.29, N = 6SE +/- 0.03, N = 3236.4013.37
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Loop UnrollingJetson TX1Desktop4080120160200Min: 224.42 / Avg: 236.4 / Max: 247.14Min: 13.31 / Avg: 13.37 / Max: 13.41

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: SOA Data LayoutJetson TX1Desktop110220330440550SE +/- 8.35, N = 3SE +/- 0.01, N = 3529.5926.68
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: SOA Data LayoutJetson TX1Desktop90180270360450Min: 512.9 / Avg: 529.59 / Max: 538.32Min: 26.67 / Avg: 26.68 / Max: 26.69

OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Flush Denormals To ZeroJetson TX1Desktop120240360480600SE +/- 5.74, N = 3SE +/- 0.01, N = 3538.0726.72
OpenBenchmarking.orgSeconds, Fewer Is BetterCUDA Mini-Nbody 2015-11-10Test: Flush Denormals To ZeroJetson TX1Desktop100200300400500Min: 527.14 / Avg: 538.07 / Max: 546.56Min: 26.7 / Avg: 26.72 / Max: 26.74

10 Results Shown

SHOC Scalable HeterOgeneous Computing:
  CUDA - FFT SP
  CUDA - MD5 Hash
  CUDA - Texture Read Bandwidth
ASKAP tConvolveCuda:
  Gridding
  Degridding
CUDA Mini-Nbody:
  Original
  Cache Blocking
  Loop Unrolling
  SOA Data Layout
  Flush Denormals To Zero