Lets say I have 2 variables.
x = 1
y = 2
The end result should be:
x = 2
y = 1
I thought about the following ways to do so:
temp = x // clone x
x = y
y = temp
or (XOR swap)
x = x XOR y
y = x XOR y
x = y XOR x
I'd like to get an answer regarding low level memory etc...
What is the fastest way to do so?
Note:
I would like to get a bonus answer, hypothetically, with no side effects (of the code, cpu), which is the fastest, or are there any other faster ones?
The problem is that modern CPU architectures will not let you get this answer. They will hide many effects and will expose many very subtle effects.
If you have the values in CPU registers and you have a spare register, then the temp way is either the fastest way, or the way which consumes the least power.
Using the XOR or the +/- (very neat by the way!) method is for situations where you cannot afford to have an extra location (extra memory variable or extra register). This might seem strange but inside a C preprocessor macro one cannot (easily) declare new variables for example.
When the variables are in memory all variants are very likely to behave the same on any high performance CPU. Even if the compiler does not optimize the code, the CPU will avoid virtually all memory accesses and make them as fast as register accesses.
In total I am inclined to say: Don't worry about the speed of this. It is unimportant to optimize at this level. Try to avoid the swap altogether, this will be the fastest!
http://en.wikipedia.org/wiki/XOR_swap_algorithm
Most modern compilers can optimize away the temporary variable in the
naive swap, in which case the naive swap uses the same amount of
memory and the same number of registers as the XOR swap and is at
least as fast, and often faster. The XOR swap is also much less
readable and completely opaque to anyone unfamiliar with the
technique. On modern CPU architectures, the XOR technique is
considerably slower than using a temporary variable to do swapping.
One reason is that modern CPUs strive to execute instructions in
parallel via instruction pipelines. In the XOR technique, the inputs
to each operation depend on the results of the previous operation, so
they must be executed in strictly sequential order.
Also see this question:
How fast is std::swap for integer types?
It's important to note that the XOR swap requires that you first check that the two variables do not reference the same memory location. If they did, you would end up setting it to zero.
XOR swap isn't always the most efficient, since most modern CPU architectures try and parallelize instructions but in the XOR swap, each line is dependent on the previous result (not parallelizable). For the temp variable swap, most compilers will optimize the temporary variable out which end up with the naive way running as fast or faster as well as using same amount of memory.
Another swap alternative is:
x = x + y
y = x - y
x = x - y
similarly, the arguments for efficiency and speed for the XOR swap apply here too.
EDIT: as hatchet said, the (+/-) approach also can cause overflow if not done carefully
Related
From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.
I have seen that many people prefer to use in code:
while(i<1000000){
ret+=a[i];
i++;
if(ret >= MOD)
ret -= MOD;
}
instead of making ret%MOD in the final step.
What is the difference between these two and how both these are equal?
How it is making an optimize our code?
Basically you can't tell without trying. There are two possible outcomes (considering my note further down below):
The compiler optimizes the code in some way that both solutions use either a conditional jump or a modulo operation. This does not only depend on how "bright" the compiler is, but it also has to consider the target architecture's available instruction set (but to be honest, it would be odd not having a modulo operation).
The compiler doesn't optimize the code (most probable for non-optimizing debug builds).
The basic difference that - as mentioned already - the solution with the if() will use one conditional jump, which - again depending on your architecture - might slow you down a bit, since the compiler can't prefetch the next instruction without evaluating the jump condition first.
One further note:
Either using a modulo operation or your if() statement actually isn't equal (depending on the actual values), simply due to the fact that ret % MOD would result in the following equal code:
while (ret >= MOD)
ret -= MOD;
Imagine a[i] being bigger than MOD and the new sum being bigger than two times MOD. In that case you'd end up with a ret bigger than MOD, something that won't happen when using modulo.
Let an example :
13 MOD 10
what it actually do is, give you the reminder after dividing 13 by 10.
that is : 13 - (10 * (int)(13/10)) = 13 - ( 10 * 1 ) = 3
so if a[i] <= mod then it will work good. but if a[i] > mod then see, what happens
let a[]= {15,15,15}
mod=7
in first step
ret = 0 + 15
ret = 15 - 7 = 8
2nd step
ret = 8 + 15 = 23
ret = 23 - 7 = 16
3rd step
ret = 16 + 15
ret = 31 - 7 = 24
So your final result is 24, but it should be 3.
you have to do :
while (ret >= MOD)
ret -= MOD;
if you want to use subtraction instead of mod..
And obviously sub is better than mod in respect to time... because mod is really time consuming :(
It is best not to try to optimise code unless you have a performance problem. Then find out where it is actually having the problems
An to answer you question the two are the same - but you need to check with the particular hardware/compiler to check.
The conditional test and subtraction is typically less expensive than a modulus operation, especially if the sum does not frequently exceed MOD. A modulus operation is effectively an integer division instruction, which typically has a latency which is an order of magnitude greater than that of compare/subtract. Having said that, unless the loop is a known performance bottleneck then you should just code for clarity and robustness.
Modulo requires integer division which is usually the slowest integer math operation on a CPU. Long ago before pipelines and branch prediction, this code was probably reliably faster than modulo. Nowadays branches can be very slow so its benefit is far from certain. If the values in a are always much smaller than MOD, it's probably still a win because the branch will be skipped most iterations and the branch predictor will mostly guess right. If they are not smaller, it's uncertain. You would need to benchmark both.
If you can write the program such that MOD is always a power of 2, you could use bit masking which is much faster than either.
If I saw this pattern in code that wasn't 1) from 1978 or 2) accompanied by a comment explaining how the author benchmarked it and found it was faster than modulo on the current compiler, typical user CPU, and a realistic data input, I'd roll my eyes hard.
Yes booth compute the same thing but:
operation % needs integer division which is more time costly then - and if
but on modern parallel machines (mean more pipelines by that not cores)
the CPU do more tasks at once unless they depend on each other or brunching occurs
that is why on modern machines is the % variant usually faster (if stalls the pipelines)
There are still platforms where the -=,if variant is faster
like MCU's so when you know you have just single CPU/MCU pipeline
or have very slow division then use this variant
you should always measure the result ties during optimization process
in your case you want to call just single mod per whole loop so it should be faster but check the later text ...
Compilers
modern compilers optimize code for your target platform and usually detect this and use the right choice
so you should not be consumed by the low level optimization instead of by programing the task functionality
but not all compilers are such for many platforms there are still used older compilers
also in some rare cases the optimizations are preferred to be turned off
because it could destroy specific desired timing, instruction patterns, or even functionality of the task ...
in such cases there is no choice and this knowledge suddenly comes handy
now the differences of your cases from algorithmic side:
while(i<1000000){ ret+=a[i]; i++; if(ret>=MOD) ret-=MOD; }
the sub result is still around modulo MOD
that mean you do not need more bits then used for max(a[i])+MOD*N where N depends on a[i]
if the sum(a[i]) will go to bignums then this will have more speed due to no need to increase sub-result bit-width
while(i<1000000){ ret+=a[i]; i++; } ret%=MOD;
this could overflow if variable ret can not hold the non modulo result
while(i<1000000){ ret+=a[i]; i++; ret%=MOD; }
this is how it should be for bigger non modulo results
if (ret>=MOD) ret-=MOD; is not modulo operation
it is just iteration of it.
more safe is while (ret>=MOD) ret-=MOD;
but if you know that the sub-result is not increasing too much (so it will not overflow in any few iterations) then if is OK
but in that case you should add while or modulo after the loop to ensure correct result
I know not to use them, but there are techniques to swap two variables without using a third, such as
x ^= y;
y ^= x;
x ^= y;
and
x = x + y
y = x - y
x = x - y
In class the prof mentioned that these were popular 20 years ago when memory was very limited and are still used in high-performance applications today. Is this true? My understanding as to why it's pointless to use such techniques is that:
It can never be the bottleneck using the third variable.
The optimizer does this anyway.
So is there ever a good time to not swap with a third variable? Is it ever faster?
Compared to each other, is the method that uses XOR vs the method that uses +/- faster? Most architectures have a unit for addition/subtraction and XOR so wouldn't that mean they are all the same speed? Or just because a CPU has a unit for the operation doesn't mean they're all the same speed?
These techniques are still important to know for the programmers who write the firmware of your average washing machine or so. Lots of that kind of hardware still runs on Z80 CPUs or similar, often with no more than 4K of memory or so. Outside of that scene, knowing these kinds of algorithmic "trickery" has, as you say, as good as no real practical use.
(I do want to remark though that nonetheless, the programmers who remember and know this kind of stuff often turn out to be better programmers even for "regular" applications than their "peers" who won't bother. Precisely because the latter often take that attitude of "memory is big enough anyway" too far.)
There's no point to it at all. It is an attempt to demonstrate cleverness. Considering that it doesn't work in many cases (floating point, pointers, structs), is unreadabe, and uses three dependent operations which will be much slower than just exchanging the values, it's absolutely pointless and demonstrates a failure to actually be clever.
You are right, if it was faster, then optimising compilers would detect the pattern when two numbers are exchanged, and replace it. It's easy enough to do. But compilers do actually notice when you exchange two variables and may produce no code at all, but start using the different variables after that. For example if you exchange x and y, then write a += x; b += y; the compiler may just change this to a += y; b += x; . The xor or add/subtract pattern on the other hand will not be recognised because it is so rare and won't get improved.
Yes, there is, especially in assembly code.
Processors have only a limited number of registers. When the registers are pretty full, this trick can avoid spilling a register to another memory location (posssibly in an unfetched cacheline).
I've actually used the 3 way xor to swap a register with memory location in the critical path of high-performance hand-coded lock routines for x86 where the register pressure was high, and there was no (lock safe!) place to put the temp. (on the X86, it is useful to know the the XCHG instruction to memory has a high cost associated with it, because it includes its own lock, whose effect I did not want. Given that the x86 has LOCK prefix opcode, this was really unnecessary, but historical mistakes are just that).
Morale: every solution, no matter how ugly looking when standing in isolation, likely has some uses. Its good to know them; you can always not use them if inappropriate. And where they are useful, they can be very effective.
Such a construct can be useful on many members of the PIC series of microcontrollers which require that almost all operations go through a single accumulator ("working register") [note that while this can sometimes be a hindrance, the fact that it's only necessary for each instruction to encode one register address and a destination bit, rather than two register addresses, makes it possible for the PIC to have a much larger working set than other microcontrollers].
If the working register holds a value and it's necessary to swap its contents with those of RAM, the alternative to:
xorwf other,w ; w=(w ^ other)
xorwf other,f ; other=(w ^ other)
xorwf other,w ; w=(w ^ other)
would be
movwf temp1 ; temp1 = w
movf other,w ; w = other
movwf temp2 ; temp2 = w
movf temp1,w ; w = temp1 [old w]
movwf other ; other = w
movf temp2,w ; w = temp2 [old other]
Three instructions and no extra storage, versus six instructions and two extra registers.
Incidentally, another trick which can be helpful in cases where one wishes to make another register hold the maximum of its present value or W, and the value of W will not be needed afterward is
subwf other,w ; w = other-w
btfss STATUS,C ; Skip next instruction if carry set (other >= W)
subwf other,f ; other = other-w [i.e. other-(other-oldW), i.e. old W]
I'm not sure how many other processors have a subtract instruction but no non-destructive compare, but on such processors that trick can be a good one to know.
These tricks are not very likely to be useful if you want to exchange two whole words in memory or two whole registers. Still you could take advantage of them if you have no free registers (or only one free register for memory-to-memoty swap) and there is no "exchange" instruction available (like when swapping two SSE registers in x86) or "exchange" instruction is too expensive (like register-memory xchg in x86) and it is not possible to avoid exchange or lower register pressure.
But if your variables are two bitfields in single word, a modification of 3-XOR approach may be a good idea:
y = (x ^ (x >> d)) & mask
x = x ^ y ^ (y << d)
This snippet is from Knuth's "The art of computer programming" vol. 4a. sec. 7.1.3. Here y is just a temporary variable. Both bitfields to exchange are in x. mask is used to select a bitfield, d is distance between bitfields.
Also you could use tricks like this in hardness proofs (to preserve planarity). See for example crossover gadget from this slide (page 7). This is from recent lectures in "Algorithmic Lower Bounds" by prof. Erik Demaine.
Of course it is still useful to know. What is the alternative?
c = a
a = b
b = c
three operations with three resources rather than three operations with two resources?
Sure the instruction set may have an exchange but that only comes into play if you are 1) writing assembly or 2) the optimizer figures this out as a swap and then encodes that instruction. Or you could do inline assembly but that is not portable and a pain to maintain, if you called an asm function then the compiler has to setup for the call burning a bunch more resources and instructions. Although it can be done you are not as likely to actually exploit the instruction sets feature unless the language has a swap operation.
Now the average programmer doesnt NEED to know this now any more than back in the day, folks will bash this kind of premature optimization, and unless you know the trick and use it often if the code isnt documented then it is not obvious so it is bad programming because it is unreadable and unmaintainable.
it is still a value programming education and exercise for example to have one invent a test to prove that it actually swaps for all combinations of bit patterns. And just like doing an xor reg,reg on an x86 to zero a register, it has a small but real performance boost for highly optimized code.
Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.
How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)
In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.
I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.
20-30 years ago arithmetic operations like division were one of the most costly operations for CPUs. Saving one division in a piece of repeatedly called code was a significant performance gain. But today CPUs have fast arithmetic operations and since they heavily use instruction pipelining, conditionals can disrupt efficient execution. If I want to optimize code for speed, should I prefer arithmetic operations in favor of conditionals?
Example 1
Suppose we want to implement operations modulo n. What will perform better:
int c = a + b;
result = (c >= n) ? (c - n) : c;
or
result = (a + b) % n;
?
Example 2
Let's say we're converting 24-bit signed numbers to 32-bit. What will perform better:
int32_t x = ...;
result = (x & 0x800000) ? (x | 0xff000000) : x;
or
result = (x << 8) >> 8;
?
All the low hanging fruits are already picked and pickled by authors of compilers and guys who build hardware. If you are the kind of person who needs to ask such question, you are unlikely to be able to optimize anything by hand.
While 20 years ago it was possible for a relatively competent programmer to make some optimizations by dropping down to assembly, nowadays it is the domain of experts, specializing in the target architecture; also, optimization requires not only knowing the program, but knowing the data it will process. Everything comes down to heuristics, tests under different conditions etc.
Simple performance questions no longer have simple answers.
If you want to optimise for speed, you should just tell your compiler to optimise for speed. Modern compilers will generally outperform you in this area.
I've sometimes been surprised trying to relate assembly code back to the original source for this very reason.
Optimise your source code for readability and let the compiler do what it's best at.
I expect that in example #1, the first will perform better. The compiler will probably apply some bit-twiddling trick to avoid a branch. But you're taking advantage of knowledge that it's extremely unlikely that the compiler can deduce: namely that the sum is always in the range [0:2*n-2] so a single subtraction will suffice.
For example #2, the second way is both faster on modern CPUs and simpler to follow. A judicious comment would be appropriate in either version. (I wouldn't be surprised to see the compiler convert the first version into the second.)