How is the Geekbench score of Snapdragon 865 greater than Intel i7 7700HQ? - cpu

Also, are there any valid methods/metrics to compare the equivalent performance (throughput) of an Intel(amd64) & a Snapdragon(ARM) processor?
The following are the links that show the Geekbench scores of i7-7700HQ and Snapdragon 865 (OnePlus 8):
i7-7700HQ: https://browser.geekbench.com/processors/intel-core-i7-7700hq (Score: 3300)
Snapdragon 865 (OP8): https://browser.geekbench.com/android_devices/oneplus-8 (Score: 3320)
Overall Rankings (check the muti-core scores):
Android: https://browser.geekbench.com/android-benchmarks
Processors: https://browser.geekbench.com/processor-benchmarks

Related

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

Building a Roofline Model

I'm trying to build a roofline model for a node in a supercomputer that I'm running simulations on. The node has 2x Intel Xeon E5-2650 v2 (Ivy Bridge) 8 core 2.6 GHz processors (16 cores per
node), with 64GB RAM total (4GB each). The maximum memory bandwidth for the Intel Xeon E5-2650 is shown here as 59.7 GB/s.
Achieved GFLOPS = max mem bandwidth x arithmetic intensity.
Max GFLOPS = num cores x clock frequency in GHz x ops/cycle.
My code has arithmetic intensity of 1/3 and uses double precision floating point.
Here are my calculations for calculating the peak GFLOPs for the different types of program:
Sequential program (single core) no vectorisation:
1x2.6x1 (I assume without vectorisation, we can only achieve 1 op/cycle?) = 2.6 GFLOPs
Sequential program (single core) with vectorisation (SSE):
1x2.6x8 = 20.8 GFLOPs
All cores on one Xeon with vectorisation (SSE):
8x2.6x8 = 166.4 GFLOPs
All cores one both Xeons with vectorisation (SSE):
2x 8x2.6x8 = 332.8 GFLOPs
How does the memory bandwidth available to the program change between the different types of program shown above? I know that the max memory bandwidth for 1 Xeon E5-2650 is 59.7 GB/s, however is this achieveable on a single core? Does this become 119.4 GB/s with 2 Xeon E2650s?
So would the achieved GFLOPs (using peak bandwidth x arithmetic intensity) be:
Sequential program w/o vectorisation:
59.7 * 1/3 = 19.9 GFLOPs, however because our roofline is 2.6 GFLOPs, we are limited to 2.6 GFLOPs?
Sequential program with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. This is achieveable because our roofline is 20.8 GFLOPs.
One Xeon (using all 8 cores) with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. I am suspicious of this, because surely our parallel program is capable of producing more mem reqs than the sequential program, and surely the sequential program doesn't saturate the memory system?
Two Xeons (total of 16 cores) with vectorisation:
119.4 * 1/3 = 39.8 GFLOPs.
I feel like something is wrong with the achieved GFLOPs, have I made a mistake somewhere?

Faster processer Weaker graphics card Or Weaker Graphics card faster processor?

So we have PC 1 and PC 2.
Which one would say for instance...be better at running a ps2 emulator for example?
PC1 has an Intel Pentium (newest gen) with a cpu of 2.58ghz
PC2 intel core 3 (4th gen) with a cpu of 2.30 ghz?
Which should I get?

Unenhanced performance of matlab GPU computing

With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)
The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.

How to derive the Peak performance in GFlop/s of Intel Xeon E5-2690?

I was able to find the theoretical DP peak performance 371 GFlop/s for the Xeon E5-2690 in this Processor Comparison (interesting that it is easier to find this information in Intel's competitor than Intel support pages itself). However, when I try to derive that peak performance my derivation doesn't match:
The frequency (in Turbo mode) for each core of the Xeon E5-2690 = 3.8Ghz
The processor can do an add and mul operation per cycle so we get: 3.8 x 2 = 7.6
Given it has AVX support it can do 4 double operations per cycle: 7.6 x 4 = 30.4
Finally, it has 8 cores, therefore we get: 8 x 30.4 = 243.2
Thus, the peak performance in Gflop/s would be 243.2 GFlop/s and not 371 GFlop/s?
Turbo Mode is not used to calculate Theoretical Peak Performance, you have to consider something like:
CPU speed = 2.9 GHz
CPU Cores = 8
CPU instruction per cycle = 8 (considering AVX-256 -> 256 bits unit, can hold 8 single precision values) x 2 (add and mul operations like you said) = 16
Putting all together:
2.9x8x16 = 371 GFlops/s

Resources