C++17 sequencing in assignment: still not implemented in GCC? - gcc

I tried the following code as a naive attempt to implement swapping of R and B bytes in an ABGR word
#include <stdio.h>
#include <stdint.h>
uint32_t ABGR_to_ARGB(uint32_t abgr)
{
return ((abgr ^= (abgr >> 16) & 0xFF) ^= (abgr & 0xFF) << 16) ^= (abgr >> 16) & 0xFF;
}
int main()
{
uint32_t tmp = 0x11223344;
printf("%x %x\n", tmp, ABGR_to_ARGB(tmp));
}
To my surprise this code "worked" in GCC in C++17 mode - the bytes were swapped
http://coliru.stacked-crooked.com/a/43d0fc47f5539746
But it is not supposed to swap bytes! C++17 clearly states that the RHS of assignment is supposed to be [fully] sequenced before the LHS, which applies to compound assignment as well. This means that in the above expression each RHS of each ^= is supposed to use the original value of abgr. Hence the ultimate result in abgr should simply have B byte XORed by R byte. This is what Clang appears to produce (amusingly, with a sequencing warning)
http://coliru.stacked-crooked.com/a/eb9bdc8ced1b5f13
A quick look at GCC assembly
https://godbolt.org/g/1hsW5a
reveals that it seems to sequence it backwards: LHS before RHS. Is this a bug? Or is this some sort of conscious decision on GCC's part? Or am I misunderstanding something?

The exact same behavior is exhibited by int a = 1; (a += a) += a;, for which GCC calculates a == 4 afterwards and clang a == 3.
The underlying ambiguity arises from this part of the standard (from working draft N4762):
[expr.ass]: 7.6.18 Assignment and compound assignment operators
Paragraph 1: The assignment operator (=) and the compound assignment operators all group right-to-left. All require a
modifiable lvalue as their left operand; their result is an lvalue referring to the left operand. The result in all
cases is a bit-field if the left operand is a bit-field. In all cases, the assignment is sequenced after the value
computation of the right and left operands, and before the value computation of the assignment expression.
The right operand is sequenced before the left operand. With respect to an indeterminately-sequenced
function call, the operation of a compound assignment is a single evaluation.
Paragraph 7: The behavior of an expression of the form E1 op = E2 is equivalent to E1 = E1 op E2 except that E1 is
evaluated only once. In += and -=, E1 shall either have arithmetic type or be a pointer to a possibly
cv-qualified completely-defined object type. In all other cases, E1 shall have arithmetic type.
GCC seems to be using this rule to internally transfrom (a += a) += a to (a = a + a) += a to a = (a = a + a) + a (since a = a + a has to be evaluated only once) - and for this expression the sequencing rules are correctly applied.
Clang however seems to do that last transformation step differently: auto temp = a + a; temp = temp + a; a = temp;
Both compilers give a warning about this, though (from the original code):
GCC: warning: operation on 'abgr' may be undefined [-Wsequence-point]
clang: warning: unsequenced modification and access to 'abgr' [-Wunsequenced]
So the compiler writers know about this ambiguity and decided to prioritize differently (GCC: Paragraph 7 > Paragraph 1; clang: Paragraph 1 > Paragraph 7).
This seems to be a defect in the standard.

Do not make things more complicated than necessary. You can swap the 2 components in a fairly straightforward way without painting yourself into dark corners of the language:
uint32_t ABGR_to_ARGB(uint32_t abgr) {
constexpr uint32_t mask = 0xff00ff00;
uint32_t grab = abgr >> 16 | abgr << 16;
return (abgr & mask) | (grab & ~mask);
}
It also generates much better assembly than the original version. On x86 it uses single rol instruction for the 3 bitwise operators to produce grab:
ABGR_to_ARGB(unsigned int):
mov eax, edi
and edi, -16711936
rol eax, 16
and eax, 16711935
or eax, edi
ret

Related

CRC32 of an appended block

I'm computing CRC32 in a rolling fashion on the contents of a file. If the file has 3 blocks ABC, CRC32 is computed linearly CRC(CRC(CRC(A, 0xffffffff), B), C). This is done with code that looks like:
uint32_t crc32(unsigned char const *buf, uint32_t buf_size, uint32_t crc32) {
for (int i = 0; i < buf_size; i++)
crc32 = (crc32 >> 8) ^ table[(crc32 ^ buf[i]) & 0xff];
return crc32;
}
Even though I write entire content ABC at once, computing the CRC as above(which gets verified at the server), read is normally done on a specific block. So, I would like to track CRC32 of each individual block as it is written.
Based on my limited understanding of how CRC32 polynomial works,
A mod G = CRC1
AB mod G = CRC2
If I want CRC32 of B, I'm thinking following should do the trick:
(CRC2 - CRC1) mod G
or
(CRC2 ^ CRC1) mod G
Of course, following code doesn't work.
uint32_t
crc32sw_diff(uint32_t crc1, uint32_t crc2)
{
uint32_t delta = crc1 ^ crc2;
return crc32(&delta, 4, 0xffffffff);
}
Other option is probably to compute CRC32 of individual blocks and combine it with something like zlib's crc32_combine() to get CRC32 of entire file.
See this answer for how CRC combination works. CRC(A) ^ CRC(B) is not equal to CRC(AB). However (for pure CRCs) using the notation that AB is the concatenated message of A followed by B, and 0 meaning an equal length message will all zeros, then CRC(A0) ^ CRC(0B) is equal to CRC(AB).
This also means that CRC(A0) ^ CRC(AB) == CRC(0B). Since CRC(0B) == CRC(B) (feeding zeros doesn't change a pure CRC), you can find it using crc32_combine() from zlib.
So, crc32_combine(crca, crcab, lenb) will return crcb.

Fastest way to swap alternate bytes on ARM Cortex M4 using gcc

I need to swap alternate bytes in a buffer as quickly as possible in an embedded system using ARM Cortex M4 processor. I use gcc. The amount of data is variable but the max is a little over 2K. it doesn't matter if a few extra bytes are converted because I can use an over-sized buffer.
I know that the ARM has the REV16 instruction, which I can use to swap alternate bytes in a 32-bit word. What I don't know is:
Is there a way of getting at this instruction in gcc without resorting to assembler? The __builtin_bswap16 intrinsic appears to operate on 16-bit words only. Converting 4 bytes at a time will surely be faster than converting 2 bytes.
Does the Cortex M4 have a reorder buffer and/or do register renaming? If not, what do I need to do to minimise pipeline stalls when I convert the dwords of the buffer in a partially-unrolled loop?
For example, is this code efficient, where REV16 is appropriately defined to resolve (1):
uint32_t *buf = ... ;
size_t n = ... ; // (number of bytes to convert + 15)/16
for (size_t i = 0; i < n; ++i)
{
uint32_t a = buf[0];
uint32_t b = buf[1];
uint32_t c = buf[2];
uint32_t d = buf[3];
REV16(a, a);
REV16(b, b);
REV16(c, c);
REV16(d, d);
buf[0] = a;
buf[1] = b;
buf[2] = c;
buf[3] = d;
buf += 4;
}
You can't use the __builtin_bswap16 function for the reason you stated, it works on 16 bit words so will 0 the other halfword. I guess the reason for this is to keep the intrinsic working the same on processors which don't have an instruction behaving similarly to REV16 on ARM.
The function
uint32_t swap(uint32_t in)
{
in = __builtin_bswap32(in);
in = (in >> 16) | (in << 16);
return in;
}
compiles to (ARM GCC 5.4.1 -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb)
rev r0, r0
ror r0, r0, #16
bx lr
And you could probably ask the compiler to inline it, which would give you 2 instructions per 32bit word. I can't think of a way to get GCC to generate REV16 with a 32bit operand, without declaring your own function with inline assembly.
EDIT
As a follow up, and based on artless noise's comment about the non portability of the __builtin_bswap functions, the compiler recognizes
uint32_t swap(uint32_t in)
{
in = ((in & 0xff000000) >> 24) | ((in & 0x00FF0000) >> 8) | ((in & 0x0000FF00) << 8) | ((in & 0xFF) << 24);
in = (in >> 16) | (in << 16);
return in;
}
and creates the same 3 instruction function as above, so that is a more portable way to achieve it. Whether different compilers would produce the same output though...
EDIT EDIT
If inline assembler is allowed, the following function
inline uint32_t Rev16(uint32_t a)
{
asm ("rev16 %1,%0"
: "=r" (a)
: "r" (a));
return a;
}
gets inlined, and acts as a single instruction as can be seen here.

Golang bitwise operations as well as general byte manipulation

I have some c# code that performs some bitwise operations on a byte. I am trying to do the same in golang but am having difficulties.
Example in c#
byte a, c;
byte[] data;
int j;
c = data[j];
c = (byte)(c + j);
c ^= a;
c ^= 0xFF;
c += 0x48;
I have read that golang cannot perform bitwise operations on the byte type. Therefore will I have to modify my code to a type uint8 to perform these operations? If so is there a clean and correct/standard way to implement this?
Go certainly can do bitwise operations on the byte type, which is simply an alias of uint8. The only changes I had to make to your code were:
Syntax of the variable declarations
Convert j to byte before adding it to c, since Go lacks (by design) integer promotion conversions when doing arithmetic.
Removing the semicolons.
Here you go
var a, c byte
var data []byte
var j int
c = data[j]
c = c + byte(j)
c ^= a
c ^= 0xFF
c += 0x48
If you're planning to do bitwise-not in Go, note that the operator for that is ^, not the ~ that is used in most other contemporary programming languages. This is the same operator that is used for xor, but the two are not ambiguous, since the compiler can tell which is which by determining whether the ^ is used as a unary or binary operator.

Convert bit vector to one bit

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?
I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:
unsigned getMask(unsigned x, unsigned masks[4])
{
return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
(x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}
I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.
If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?
x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.
Intel's intrinsics guide is pretty good, see also the SSE tag wiki
#include <immintrin.h>
static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
__m128i vx = _mm_set1_epi32(x);
__m128i vm = _mm_load_si128(masks); // or loadu if this can inline where masks[] isn't aligned
__m128i and = _mm_and_si128(vx, vm);
__m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128()); // vector of 0 or -1 elems
unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
return zeromask ^ 0xf; // flip the low 4 bits
}
Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)
The usual way to do this in x86 is:
test eax, eax
setne al
You can use !! to coerce a value to 0 or 1 and rewrite the expression like this
return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
(!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);

Swapping Integers Efficiency

Simply,
X = Integer
Y = Another Integer
Z ( If used ,Integer Temp )
What's the most efficient method ?
Method I :
Z = X
X = Y
Y = Z
Method II :
X ^= Y
Y ^= X
X ^= Y
Edit I [ Assembly View ]
Method I :
MOV
MOV
MOV
Method II :
TEST ( AND )
JZ
XOR
XOR
XOR
Notes :
MOV is slower then XOR
TEST , JZ is used for XOR Equality Safe
`Method I uses extra register
In most cases, using a temporary variable (usually a register at assembly level) is the best choice, and the one that a compiler will tend to generate.
In most practical scenarios, the trivial swap algorithm using a
temporary register is more efficient. Limited situations in which XOR
swapping may be practical include: On a processor where the
instruction set encoding permits the XOR swap to be encoded in a
smaller number of bytes; In a region with high register pressure, it
may allow the register allocator to avoid spilling a register. In
microcontrollers where available RAM is very limited. Because these
situations are rare, most optimizing compilers do not generate XOR
swap code.
http://en.wikipedia.org/wiki/XOR_swap_algorithm
Also, your XOR Swap implementation fails if the same variable is passed as both arguments. A correct implementation (from the same link) would be:
void xorSwap (int *x, int *y) {
if (x != y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
Note that the code does not swap the integers passed immediately, but
first checks if their addresses are distinct. This is because, if the
addresses are equal, the algorithm will fold to a triple *x ^= *x
resulting in zero.
Try this way of swapping numbers
int a,b;
a=a+b-(b=a);

Resources