Swapping Integers Efficiency - algorithm

Simply,
X = Integer
Y = Another Integer
Z ( If used ,Integer Temp )
What's the most efficient method ?
Method I :
Z = X
X = Y
Y = Z
Method II :
X ^= Y
Y ^= X
X ^= Y
Edit I [ Assembly View ]
Method I :
MOV
MOV
MOV
Method II :
TEST ( AND )
JZ
XOR
XOR
XOR
Notes :
MOV is slower then XOR
TEST , JZ is used for XOR Equality Safe
`Method I uses extra register

In most cases, using a temporary variable (usually a register at assembly level) is the best choice, and the one that a compiler will tend to generate.
In most practical scenarios, the trivial swap algorithm using a
temporary register is more efficient. Limited situations in which XOR
swapping may be practical include: On a processor where the
instruction set encoding permits the XOR swap to be encoded in a
smaller number of bytes; In a region with high register pressure, it
may allow the register allocator to avoid spilling a register. In
microcontrollers where available RAM is very limited. Because these
situations are rare, most optimizing compilers do not generate XOR
swap code.
http://en.wikipedia.org/wiki/XOR_swap_algorithm
Also, your XOR Swap implementation fails if the same variable is passed as both arguments. A correct implementation (from the same link) would be:
void xorSwap (int *x, int *y) {
if (x != y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
Note that the code does not swap the integers passed immediately, but
first checks if their addresses are distinct. This is because, if the
addresses are equal, the algorithm will fold to a triple *x ^= *x
resulting in zero.

Try this way of swapping numbers
int a,b;
a=a+b-(b=a);

Related

C++17 sequencing in assignment: still not implemented in GCC?

I tried the following code as a naive attempt to implement swapping of R and B bytes in an ABGR word
#include <stdio.h>
#include <stdint.h>
uint32_t ABGR_to_ARGB(uint32_t abgr)
{
return ((abgr ^= (abgr >> 16) & 0xFF) ^= (abgr & 0xFF) << 16) ^= (abgr >> 16) & 0xFF;
}
int main()
{
uint32_t tmp = 0x11223344;
printf("%x %x\n", tmp, ABGR_to_ARGB(tmp));
}
To my surprise this code "worked" in GCC in C++17 mode - the bytes were swapped
http://coliru.stacked-crooked.com/a/43d0fc47f5539746
But it is not supposed to swap bytes! C++17 clearly states that the RHS of assignment is supposed to be [fully] sequenced before the LHS, which applies to compound assignment as well. This means that in the above expression each RHS of each ^= is supposed to use the original value of abgr. Hence the ultimate result in abgr should simply have B byte XORed by R byte. This is what Clang appears to produce (amusingly, with a sequencing warning)
http://coliru.stacked-crooked.com/a/eb9bdc8ced1b5f13
A quick look at GCC assembly
https://godbolt.org/g/1hsW5a
reveals that it seems to sequence it backwards: LHS before RHS. Is this a bug? Or is this some sort of conscious decision on GCC's part? Or am I misunderstanding something?
The exact same behavior is exhibited by int a = 1; (a += a) += a;, for which GCC calculates a == 4 afterwards and clang a == 3.
The underlying ambiguity arises from this part of the standard (from working draft N4762):
[expr.ass]: 7.6.18 Assignment and compound assignment operators
Paragraph 1: The assignment operator (=) and the compound assignment operators all group right-to-left. All require a
modifiable lvalue as their left operand; their result is an lvalue referring to the left operand. The result in all
cases is a bit-field if the left operand is a bit-field. In all cases, the assignment is sequenced after the value
computation of the right and left operands, and before the value computation of the assignment expression.
The right operand is sequenced before the left operand. With respect to an indeterminately-sequenced
function call, the operation of a compound assignment is a single evaluation.
Paragraph 7: The behavior of an expression of the form E1 op = E2 is equivalent to E1 = E1 op E2 except that E1 is
evaluated only once. In += and -=, E1 shall either have arithmetic type or be a pointer to a possibly
cv-qualified completely-defined object type. In all other cases, E1 shall have arithmetic type.
GCC seems to be using this rule to internally transfrom (a += a) += a to (a = a + a) += a to a = (a = a + a) + a (since a = a + a has to be evaluated only once) - and for this expression the sequencing rules are correctly applied.
Clang however seems to do that last transformation step differently: auto temp = a + a; temp = temp + a; a = temp;
Both compilers give a warning about this, though (from the original code):
GCC: warning: operation on 'abgr' may be undefined [-Wsequence-point]
clang: warning: unsequenced modification and access to 'abgr' [-Wunsequenced]
So the compiler writers know about this ambiguity and decided to prioritize differently (GCC: Paragraph 7 > Paragraph 1; clang: Paragraph 1 > Paragraph 7).
This seems to be a defect in the standard.
Do not make things more complicated than necessary. You can swap the 2 components in a fairly straightforward way without painting yourself into dark corners of the language:
uint32_t ABGR_to_ARGB(uint32_t abgr) {
constexpr uint32_t mask = 0xff00ff00;
uint32_t grab = abgr >> 16 | abgr << 16;
return (abgr & mask) | (grab & ~mask);
}
It also generates much better assembly than the original version. On x86 it uses single rol instruction for the 3 bitwise operators to produce grab:
ABGR_to_ARGB(unsigned int):
mov eax, edi
and edi, -16711936
rol eax, 16
and eax, 16711935
or eax, edi
ret

Popcount of SSE vectors for binary correlation?

I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)
Here is the much simplified code:
int correlation(uint64_t *v1, uint64_t *v2, int size64) {
__m128i* a = reinterpret_cast<__m128i*>(v1);
__m128i* b = reinterpret_cast<__m128i*>(v2);
int count = 0;
for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
union { __m128i s; uint64_t b[2]} x;
x.s = _mm_xor_si128(*a, *b);
count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
}
return count;
}
I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.
I am happy with its current performance but I was just curious to see if it could be further improved.
Thanks.
Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.
for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
__m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
__m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}
Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and
-fprefetch-loop-arrays args. Something like this:
for (int j = 0; j < size64; ++j) {
count += __builtin_popcountll(a[j] ^ b[j]);
}
Third Edit:
This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to #Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.
static inline __m128i hamming128(__m128i a, __m128i b) {
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const __m128i x = _mm_xor_si128(a, b);
const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
return _mm_add_epi8(pcnt0, pcnt1);
}
On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.
The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.
uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0;
for(size_t i = 0; i < count; i++) {
sum += popcnt(a[i] ^ b[i]);
}
return sum;
}
This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:
uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=2) {
sum += popcnt(a[i ] ^ b[i ]);
sum2 += popcnt(a[i+1] ^ b[i+1]);
}
return sum + sum2;
}
This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed # 2.4Ghz).
How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.
Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).
$LL3#count_set_:
; Line 50
mov rcx, QWORD PTR [r10+rax-8]
lea rax, QWORD PTR [rax+32]
xor rcx, QWORD PTR [rax-40]
popcnt rcx, rcx
add r9, rcx
; Line 51
mov rcx, QWORD PTR [r10+rax-32]
xor rcx, QWORD PTR [rax-32]
popcnt rcx, rcx
add r11, rcx
; Line 52
mov rcx, QWORD PTR [r10+rax-24]
xor rcx, QWORD PTR [rax-24]
popcnt rcx, rcx
add rbx, rcx
; Line 53
mov rcx, QWORD PTR [r10+rax-16]
xor rcx, QWORD PTR [rax-16]
popcnt rcx, rcx
add rdi, rcx
dec rdx
jne SHORT $LL3#count_set_
You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.
Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.
I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.
For reference these are my benchmarks:
"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:
uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum1 = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=4) {
__m128i tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
}
return sum1 + sum2;
}
Have a look here. There is an SSSE3 version that beats the popcnt instruction by a lot. I'm not sure but you may be able to extend it to AVX as well.

Convert bit vector to one bit

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?
I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:
unsigned getMask(unsigned x, unsigned masks[4])
{
return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
(x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}
I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.
If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?
x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.
Intel's intrinsics guide is pretty good, see also the SSE tag wiki
#include <immintrin.h>
static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
__m128i vx = _mm_set1_epi32(x);
__m128i vm = _mm_load_si128(masks); // or loadu if this can inline where masks[] isn't aligned
__m128i and = _mm_and_si128(vx, vm);
__m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128()); // vector of 0 or -1 elems
unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
return zeromask ^ 0xf; // flip the low 4 bits
}
Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)
The usual way to do this in x86 is:
test eax, eax
setne al
You can use !! to coerce a value to 0 or 1 and rewrite the expression like this
return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
(!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);

swaping inplace

how to swap two numbers inplace without using any additional space?
You can do it using XOR operator as:
if( x != y) { // this check is very important.
x ^= y;
y ^= x;
x ^= y;
}
EDIT:
Without the additional check the above logic fails to swap the number with itself.
Example:
int x = 10;
if I apply the above logic to swap x with itself, without the check I end up having x=0, which is incorrect.
Similarly if I put the logic without the check in a function and call the function to swap two references to the same variable, it fails.
If you have 2 variables a and b: (each variable occupies its own memory address)
a = a xor b
b = a xor b
a = a xor b
There are also some other variations to this problem but they will fail if there is overflow:
a=a+b
b=a-b
a=a-b
a=a*b
b=a/b
a=a/b
The plus and minus variation may work if you have custom types that have + and - operators that make sense.
Note: To avoid confusion, if you have only 1 variable, and 2 references or pointers to it, then all of the above will fail. A check should be made to avoid this.
Unlike a lot of people are saying it does not matter if you have 2 different numbers. It only matters that you have 2 distinct variables where the number exists in 2 different memory addresses.
I.e. this is perfectly valid:
int a = 3;
int b = 3;
a = a ^ b;
b = a ^ b;
a = a ^ b;
assert(a == b);
assert(a == 3);
The xor trick is the standard answer:
int x, y;
x ^= y;
y ^= x;
x ^= y;
xoring is considerably less clear than just using a temp, though, and it fails if x and y are the same location
Since no langauge was mentioned, in Python:
y, x = x, y

Stack Overflow during SWAP

How can we take care of the overflow happening during swapping of two variables without using a third variable. I believe the XOR solution can be used only for integers. what about other variable types?
This isn't an answer but it doesn't fit in a comment.
Under what circumstances would you be running so close to the edge of your available stack storage that the additional use of a temporary variable for the swap is going to cause you difficulties?
I could see some embedded scenarios, but I'm hard pressed to imagine a scenario where you'd be so tight on stack space that this would matter (where you're not writing code in assembly language).
XOR will work for anything you can get your XOR operator to process; it's a property of binary data, not of binary data used to represent integers.
By not doing it at all. The XOR swap algorithm is cool hack. It shouldn't be used in production code.
The XOR solution works with any type that can be bitwise copied, not just integers. However, do not XOR a variable with itself: i.e.
int x = 10;
int *p1 = &x;
int *p2 = p1;
*p1 = *p1 ^ *p2;
*p2 = *p1 ^ *p2;
*p1 = *p1 ^ *p2;
/* now x == 0 :( */
What's wrong with XCHG? No stack needed, no overflow(carry flag)? set either :).
a = a + b;
b = a - b;
a = a - b;
This will work for integers and float.

Resources