Do modern cpus skip multiplications by zero?

Do modern cpus skip multiplications by zero? - cpu

I would like to know if current cpus avoid multiplying two numbers when at least one of them is a zero. Thanks

This varies widely depending on the CPU and (in some cases) the type(s) of the operands.
Older/simpler CPUs typically use a multiplication algorithm something like this:
integer operator*(integer const &other) {
unsigned temp1 = other.value;
unsigned temp2 = value;
unsigned answer = 0;
while (temp1 != 0) {
if (temp1 & 1)
answer += temp2;
temp2 <<= 1;
temp1 >>=1;
}
return integer(answer);
}
Since the loop executes only when/if temp1 != 0, the loop obviously won't execute if temp1 starts out as 0 (but as written here, won't attempt any optimization for the other operand being 0).
That, however, is fundamentally a one bit at a time algorithm. For example, when multiplying 32-bit operands, if each bit has a 50:50 chance of being set, we expect approximately 16 iterations on average.
A newer, high-end CPU will typically work with at least two bits at a time, and perhaps even more than that. Instead of a single piece of hardware executing multiple iterations, it'll typically pipeline the operation with separate (albeit, essentially identical) hardware for each stage of the multiplication (though these won't normally show up as separate stages on a normal pipeline diagram for the processor).
This means the execution will have the same latency (and throughput) regardless of the operands. On average it improves latency a little bit and throughput a lot, but does lead to every operation happening at the same speed, regardless of operands.

I would expect that a modern desktop CPU would have such a thing in it.

Related

OpenCL Pseudo random generator race condition

Using this question as basis I implemented a pseudo-random number generator with a global state:
__global uint global_random_state;
void set_random_seed(uint seed){
global_random_state = seed;
}
uint get_random_number(uint range){
uint seed = global_random_state + get_global_id(0);
uint t = seed ^ (seed << 11);
uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
global_random_state = result; /* race condition? */
return result % range;
}
Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.
This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?

In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?

It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.

Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?

Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Turn the code into a code using SIMD instructions

I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?

Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.

Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio