Why is the CPU slower for calculations then the GPU when only Memory should matter? - algorithm

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?

See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.

"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.

I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

Related

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

Calculate CPU cycles and its handling bits per second

I've wondered about the amount of bits that a given CPU can handles and how should I manage to calculate this specific value.
I wanted to make sure that my thought and also my calculation are right.
Given that I have 64 bit CPU which feature 2.3 Ghz. The amount of bits that being handled per second should be the following :
(2.3 * 10^9) * 64
Is it that simple or I need consider any other variables?
Calculating throughput of a CPU is not as simple. Modern CPUs are pipelined and can execute multiple instructions concurrently if there are no data dependencies, yielding instructions per cycle (IPC) > 1. Some instructions may also have a higher latency than one cycle, meaning IPC can be < 1.
For some operations they might be constrained by memory bandwidth.
And then there are SIMD instructions which can process more data per instruction than the width of a single general purpose register.
http://www.agner.org/optimize/instruction_tables.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining

Calculating actual flop/core when using actual memory bandwidth

I want to calculate the actual amount of mflop/s/core using the following information:
I have measured actual amount of memory bandwidth of each core in 1 node which is 4371 MB/s.
I have also measured mflop/s/core on one node if I use only one core on a node(in this case the whole memory of node would be available for that core), the result is 2094.45. So I measured the memory bandwidth which was available for that core = 10812.3 MB/s
So now I want to calculate the actual mflop/s/core when the core has it's real memory bandwidth (4371MB/s).
Do you think it would be correct if I calculate it like this:
actual mflop/s/core= (mflop/s/core * actual memory bw) / used memory bandwidth
Any help would be appreciated.

Problems about "cycles/byte" change to "MB/s" when measuring AES speed

I am reading the AES (Advanced Encryption Standard) article (see http://cr.yp.to/aes-speed/aesspeed-20080926.pdf)
They measure the encryption speed using "Cycles/byte", as I know, if the CPU is known, I can calculate this "Cycles/byte" to "MB/s" by "(number of processors × processor_utilization × processor clock frequency) / Throughput rate in bytes per second or transaction per second = cycles per Byte"
My question is what if I am using a four core Intel CPU, how should I calculate how many processors I am using or these four cores are treated as "one CPU". It seems like they cannot be treated as one CPU because a old CPU with 3.0 GHz will run faster than 2.2GHz dual core CPU then.
Thanks in advance.

How to avoid TLB miss (and high Global Memory Replay Overhead) in CUDA GPUs?

The title might be more specific than my actual problem is, although I believe answering this question would solve a more general problem, which is: how to decrease the effect of high latency (~700 cycle) that comes from random (but coalesced) global memory access in GPUs.
In general if one accesses the global memory with coalesced load (eg. I read 128 consecutive bytes), but with very large distance (256KB-64MB) between coalesced accesses, one gets a high TLB (Translation Lookaside Buffer) miss rate. This high TLB miss rate is due to the limited number (~512) and size (~4KB) of the memory pages used in the TLB lookup table.
I suppose the high TLB miss rate because of the fact that virtual memory is used by NVIDIA, the fact that I get high (98%) Global Memory Replay Overhead and low throughput (45GB/s, with a K20c) in the profiler and the fact that partition camping is not an issue since Fermi.
Is it possible to avoid high TLB miss rate somehow? Would 3D texture cache help if I'm accessing a (X x Y x Z) cube coalesced along X dimension and with a X*Y "stride" along the Z dimension?
Any comment on this topic is appreciated.
Constraints: 1) global data can not be reordered/transposed; 2) kernel is communication bound.
You can only avoid TLB misses by changing your memory access pattern. A different layout of your data in memory can help with this.
A 3D texture will not improve your situation, as it trades improved spatial locality in two additional dimensions against reduced spatial locality in the third dimension. Thus you would unnecessarily read data of neighbors along the Y axis.
What you can do however is mitigate the impact of the resulting latency on throughput. In order to hide t = 700 cycles of latency at a global memory bandwidth of b = 250GB/s, you need to have memory transactions for b / t = 175 KB of data in flight at any time (or 12.5 KB for each of the 14 SMX). With a fully loaded memory interface and a high ratio of TLB misses, you will however find that latency gets closer to 2000 cycles, requiring roughly 32 KB of transactions in flight per sm.
As each word of a memory read transaction in flight requires one register where the value will be stored once it arrives, hiding memory latency has to be balances against register pressure. Keeping 32 KB of data in flight requires 8192 registers, or 12.5% of the total registers available on an SMX.
(Note that for above rough estimates I have neglected the difference between KiB and KB).

Resources