times of the Fortran matmul function with different multiplication sizes - time

I have calculated the time spent by the Fortran's MATMUL function with different multiplication sizes (32 × 32, 64 × 64, ...) and I have questions about the results.
These are the results:
SIZE ----- TIME IN SECONDS
32 ----- 0,000071
64 ----- 0,000032
128 ----- 0,001889
256 ----- 0,010866
512 ----- 0,043
1024 ----- 0,336
2048 ----- 2,878
4096 ----- 51,932
8192 ----- 405,921856
I guess the times should increase by a factor of 8 (m * 2 * n * 2 * k * 2). I do not know if it should be like that. If so, who can tell why it is not like that?
In addition, we see an increase of a factor of 18 with multiplications of 2048 a
4096. Could someone tell me why?
I have measured the times with CALL CPU_TIME() from Fortran and with CALL DATE_AND_TIME() from Fortran and both give very similar results.
My processor is an AMD Phenom (tm) II X4 945 Processor with 4 cores

#Steve is correct, there are many factors that affect performance especially when data sizes are small. Thats why all of your results at and below 2048 are pretty much semi-random and essentially irrelevant. All or most of the data is likely in several layers of CPU cache. So flushing CPU threads and other hardware related events are making these results very skewed. If you run these tests again you will find different results at these small sizes.
So, when you go from 2048 to 4096 you get a major jump. All the data no longer fits into the CPU caches. The computer needs to load blocks of data from RAM into the CPU caches. This explains the large jump in time.
It is at these sizes and larger that the computer has to do more typical operations (load data, perform operations, save data to RAM) and this is the performance you will get as data gets even larger. This is also where performance becomes very consistent as data grows larger. Notice that going from 4096 to 8192 is very close to exactly 8 times longer. At this point, going to 16384 will take almost exactly 8 times 406 seconds.
Any size smaller than 4096 is not giving your computer enough work to accurately measure the performance.

There should be a factor 8 between each timing, and deviations are generally due to memory management like cache alignment and cache- vs array-size. For small arrays there might be a calling overhead to matmul(). A triple do-loop can be faster, at least with some optimization (try -O3 -march=native), and should work equally well for small sizes.

Related

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

Why is the CPU slower for calculations then the GPU when only Memory should matter?

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?
See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.
"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.
I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

Resources