Convert bit vector to one bit - performance

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?
I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:
unsigned getMask(unsigned x, unsigned masks[4])
{
return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
(x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}
I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.
If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?

x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.
Intel's intrinsics guide is pretty good, see also the SSE tag wiki
#include <immintrin.h>
static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
__m128i vx = _mm_set1_epi32(x);
__m128i vm = _mm_load_si128(masks); // or loadu if this can inline where masks[] isn't aligned
__m128i and = _mm_and_si128(vx, vm);
__m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128()); // vector of 0 or -1 elems
unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
return zeromask ^ 0xf; // flip the low 4 bits
}
Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)

The usual way to do this in x86 is:
test eax, eax
setne al

You can use !! to coerce a value to 0 or 1 and rewrite the expression like this
return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
(!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);

Related

C++17 sequencing in assignment: still not implemented in GCC?

I tried the following code as a naive attempt to implement swapping of R and B bytes in an ABGR word
#include <stdio.h>
#include <stdint.h>
uint32_t ABGR_to_ARGB(uint32_t abgr)
{
return ((abgr ^= (abgr >> 16) & 0xFF) ^= (abgr & 0xFF) << 16) ^= (abgr >> 16) & 0xFF;
}
int main()
{
uint32_t tmp = 0x11223344;
printf("%x %x\n", tmp, ABGR_to_ARGB(tmp));
}
To my surprise this code "worked" in GCC in C++17 mode - the bytes were swapped
http://coliru.stacked-crooked.com/a/43d0fc47f5539746
But it is not supposed to swap bytes! C++17 clearly states that the RHS of assignment is supposed to be [fully] sequenced before the LHS, which applies to compound assignment as well. This means that in the above expression each RHS of each ^= is supposed to use the original value of abgr. Hence the ultimate result in abgr should simply have B byte XORed by R byte. This is what Clang appears to produce (amusingly, with a sequencing warning)
http://coliru.stacked-crooked.com/a/eb9bdc8ced1b5f13
A quick look at GCC assembly
https://godbolt.org/g/1hsW5a
reveals that it seems to sequence it backwards: LHS before RHS. Is this a bug? Or is this some sort of conscious decision on GCC's part? Or am I misunderstanding something?
The exact same behavior is exhibited by int a = 1; (a += a) += a;, for which GCC calculates a == 4 afterwards and clang a == 3.
The underlying ambiguity arises from this part of the standard (from working draft N4762):
[expr.ass]: 7.6.18 Assignment and compound assignment operators
Paragraph 1: The assignment operator (=) and the compound assignment operators all group right-to-left. All require a
modifiable lvalue as their left operand; their result is an lvalue referring to the left operand. The result in all
cases is a bit-field if the left operand is a bit-field. In all cases, the assignment is sequenced after the value
computation of the right and left operands, and before the value computation of the assignment expression.
The right operand is sequenced before the left operand. With respect to an indeterminately-sequenced
function call, the operation of a compound assignment is a single evaluation.
Paragraph 7: The behavior of an expression of the form E1 op = E2 is equivalent to E1 = E1 op E2 except that E1 is
evaluated only once. In += and -=, E1 shall either have arithmetic type or be a pointer to a possibly
cv-qualified completely-defined object type. In all other cases, E1 shall have arithmetic type.
GCC seems to be using this rule to internally transfrom (a += a) += a to (a = a + a) += a to a = (a = a + a) + a (since a = a + a has to be evaluated only once) - and for this expression the sequencing rules are correctly applied.
Clang however seems to do that last transformation step differently: auto temp = a + a; temp = temp + a; a = temp;
Both compilers give a warning about this, though (from the original code):
GCC: warning: operation on 'abgr' may be undefined [-Wsequence-point]
clang: warning: unsequenced modification and access to 'abgr' [-Wunsequenced]
So the compiler writers know about this ambiguity and decided to prioritize differently (GCC: Paragraph 7 > Paragraph 1; clang: Paragraph 1 > Paragraph 7).
This seems to be a defect in the standard.
Do not make things more complicated than necessary. You can swap the 2 components in a fairly straightforward way without painting yourself into dark corners of the language:
uint32_t ABGR_to_ARGB(uint32_t abgr) {
constexpr uint32_t mask = 0xff00ff00;
uint32_t grab = abgr >> 16 | abgr << 16;
return (abgr & mask) | (grab & ~mask);
}
It also generates much better assembly than the original version. On x86 it uses single rol instruction for the 3 bitwise operators to produce grab:
ABGR_to_ARGB(unsigned int):
mov eax, edi
and edi, -16711936
rol eax, 16
and eax, 16711935
or eax, edi
ret

Fastest way to swap alternate bytes on ARM Cortex M4 using gcc

I need to swap alternate bytes in a buffer as quickly as possible in an embedded system using ARM Cortex M4 processor. I use gcc. The amount of data is variable but the max is a little over 2K. it doesn't matter if a few extra bytes are converted because I can use an over-sized buffer.
I know that the ARM has the REV16 instruction, which I can use to swap alternate bytes in a 32-bit word. What I don't know is:
Is there a way of getting at this instruction in gcc without resorting to assembler? The __builtin_bswap16 intrinsic appears to operate on 16-bit words only. Converting 4 bytes at a time will surely be faster than converting 2 bytes.
Does the Cortex M4 have a reorder buffer and/or do register renaming? If not, what do I need to do to minimise pipeline stalls when I convert the dwords of the buffer in a partially-unrolled loop?
For example, is this code efficient, where REV16 is appropriately defined to resolve (1):
uint32_t *buf = ... ;
size_t n = ... ; // (number of bytes to convert + 15)/16
for (size_t i = 0; i < n; ++i)
{
uint32_t a = buf[0];
uint32_t b = buf[1];
uint32_t c = buf[2];
uint32_t d = buf[3];
REV16(a, a);
REV16(b, b);
REV16(c, c);
REV16(d, d);
buf[0] = a;
buf[1] = b;
buf[2] = c;
buf[3] = d;
buf += 4;
}
You can't use the __builtin_bswap16 function for the reason you stated, it works on 16 bit words so will 0 the other halfword. I guess the reason for this is to keep the intrinsic working the same on processors which don't have an instruction behaving similarly to REV16 on ARM.
The function
uint32_t swap(uint32_t in)
{
in = __builtin_bswap32(in);
in = (in >> 16) | (in << 16);
return in;
}
compiles to (ARM GCC 5.4.1 -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb)
rev r0, r0
ror r0, r0, #16
bx lr
And you could probably ask the compiler to inline it, which would give you 2 instructions per 32bit word. I can't think of a way to get GCC to generate REV16 with a 32bit operand, without declaring your own function with inline assembly.
EDIT
As a follow up, and based on artless noise's comment about the non portability of the __builtin_bswap functions, the compiler recognizes
uint32_t swap(uint32_t in)
{
in = ((in & 0xff000000) >> 24) | ((in & 0x00FF0000) >> 8) | ((in & 0x0000FF00) << 8) | ((in & 0xFF) << 24);
in = (in >> 16) | (in << 16);
return in;
}
and creates the same 3 instruction function as above, so that is a more portable way to achieve it. Whether different compilers would produce the same output though...
EDIT EDIT
If inline assembler is allowed, the following function
inline uint32_t Rev16(uint32_t a)
{
asm ("rev16 %1,%0"
: "=r" (a)
: "r" (a));
return a;
}
gets inlined, and acts as a single instruction as can be seen here.

Checksum for short data on microcontroller?

I'm looking for a good checksum for short binary data messages (3-5 bytes typical) on a microcontroller. I would like something that detects the kinds of errors that can sometimes happen on an SPI bus, for example off-by-ones and repeats ("abc" -> "bcd", and "abc"->"aab"). Also it should catch the edge cases of all-zeros, all-ones and all-same-value. The checksum can add 2-4 bytes.
Running speed is not as critical as this will not process very much data; but code size is somewhat important.
I ended up using CRC16 CCITT. This is only ~50 bytes of compiled code on the target system (not using any lookup tables!), runs reasonably fast, and handles all-zero and all-one cases pretty decently.
Code (from http://www.sal.wisc.edu/st5000/documents/tables/crc16.c):
unsigned short int
crc16(unsigned char *p, int n)
{
unsigned short int crc = 0xffff;
while (n-- > 0) {
crc = (unsigned char)(crc >> 8) | (crc << 8);
crc ^= *p++;
crc ^= (unsigned char)(crc & 0xff) >> 4;
crc ^= (crc << 8) << 4;
crc ^= ((crc & 0xff) << 4) << 1;
}
return(crc);
}
See http://pubs.opengroup.org/onlinepubs/009695299/utilities/cksum.html for the algorithm used by cksum, which is itself based on the one used within the ethernet standard. Its use within ethernet is to catch errors that are similar to the ones that you face.
That algorithm will give you a 4 byte checksum for any size of data that you wish.

Algorithm to find the most significant bit [duplicate]

This question already has answers here:
bitwise most significant set bit
(10 answers)
Closed 9 years ago.
A friend of mine was asked at an interview the following question: "Given a binary number, find the most significant bit". I immediately thought of the following solution but am not sure if it is correct.
Namely, divide the string into two parts and convert both parts into decimal. If the left-subarray is 0 in decimal then the do binary search in the right subarray, looking for 1.
That is my other question. Is the most significant bit, the left-most 1 in a binary number? Can you show me an example when a 0 is the most significant bit with an example and explanation.
EDIT:
There seems to be a bit of confusion in the answers below so I am updating the question to make it more precise. The interviewer said "you have a website that you receive data from until the most significant bit indicates to stop transmitting data" How would you go about telling the program to stop the data transfer"
You could also use bit shifting. Pseudo-code:
number = gets
bitpos = 0
while number != 0
bitpos++ # increment the bit position
number = number >> 1 # shift the whole thing to the right once
end
puts bitpos
if the number is zero, bitpos is zero.
Finding the most significant bit in a word (i.e. calculating log2 with rounding down) by using only C-like language instructions can be done by using a rather well-known method based on De Bruijn sequences. For example, for a 32-bit value
unsigned ulog2(uint32_t v)
{ /* Evaluates [log2 v] */
static const unsigned MUL_DE_BRUIJN_BIT[] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MUL_DE_BRUIJN_BIT[(v * 0x07C4ACDDu) >> 27];
}
However, in practice more simple methods (like unrolled binary search) usually work just as well or better.
The edited question is really quite different, though not very clear. Who are "you"? The website or the programmer of the program that reads data from the website? If you're the website, you make the program stop by sending a value (but what, a byte, probably?) with its most-significant bit set. Just OR or ADD that bit in. If you're the programmer, you test the most-significant bit of the values you receive, and stop reading when it becomes set. For unsigned bytes, you could do the test like
bool stop = received_byte >= 128;
or
bool stop = received_byte & 128;
For signed bytes, you could use
bool stop = received_byte < 0;
or
bool stop = received_byte & 128;
If you're not reading bytes but, say, 32bit words, the 128 changes to (1 << 31).
This is one approach (not necessarily the most efficient, though, especially if your platform has a single-instruction solution to find-first-one or count-leading-zeros or something similar), assuming twos complement signed integers and a 32-bit integer width.
int mask = (int)(1U<<31); // signed integer with only bit 32 set
while (! n & mask) // n is the int we're testing against
mask >>= 1; // take advantage of sign fill on right shift of negative number
mask = mask ^ (mask << 1) // isolate first bit that matched with n
If you want the bit position of that first one, simply add a integer counter that starts at 31 and gets decremented on each loop iteration.
One downside to this is if n == 0, it's an infinite loop, so test for zero beforehand.
If you are interested in a C/C++ solution you can have a look at the book "Matters Computational" by Jörg Arndt where you have these functions defined in section "1.6.1 Isolating the highest one and finding its index":
static inline ulong highest_one_idx(ulong x)
// Return index of highest bit set.
// Return 0 if no bit is set.
{
#if defined BITS_USE_ASM
return asm_bsr(x);
#else // BITS_USE_ASM
#if BITS_PER_LONG == 64
#define MU0 0x5555555555555555UL // MU0 == ((-1UL)/3UL) == ...01010101_2
#define MU1 0x3333333333333333UL // MU1 == ((-1UL)/5UL) == ...00110011_2
#define MU2 0x0f0f0f0f0f0f0f0fUL // MU2 == ((-1UL)/17UL) == ...00001111_2
#define MU3 0x00ff00ff00ff00ffUL // MU3 == ((-1UL)/257UL) == (8 ones)
#define MU4 0x0000ffff0000ffffUL // MU4 == ((-1UL)/65537UL) == (16 ones)
#define MU5 0x00000000ffffffffUL // MU5 == ((-1UL)/4294967297UL) == (32 ones)
#else
#define MU0 0x55555555UL // MU0 == ((-1UL)/3UL) == ...01010101_2
#define MU1 0x33333333UL // MU1 == ((-1UL)/5UL) == ...00110011_2
#define MU2 0x0f0f0f0fUL // MU2 == ((-1UL)/17UL) == ...00001111_2
#define MU3 0x00ff00ffUL // MU3 == ((-1UL)/257UL) == (8 ones)
#define MU4 0x0000ffffUL // MU4 == ((-1UL)/65537UL) == (16 ones)
#endif
ulong r = (ulong)ld_neq(x, x & MU0)
+ ((ulong)ld_neq(x, x & MU1) << 1)
+ ((ulong)ld_neq(x, x & MU2) << 2)
+ ((ulong)ld_neq(x, x & MU3) << 3)
+ ((ulong)ld_neq(x, x & MU4) << 4);
#if BITS_PER_LONG > 32
r += ((ulong)ld_neq(x, x & MU5) << 5);
#endif
return r;
#undef MU0
#undef MU1
#undef MU2
#undef MU3
#undef MU4
#undef MU5
#endif
}
where asm_bsr is implemented depending on your processor architecture
// i386
static inline ulong asm_bsr(ulong x)
// Bit Scan Reverse: return index of highest one.
{
asm ("bsrl %0, %0" : "=r" (x) : "0" (x));
return x;
}
or
// AMD64
static inline ulong asm_bsr(ulong x)
// Bit Scan Reverse
{
asm ("bsrq %0, %0" : "=r" (x) : "0" (x));
return x;
}
Go here for the code: http://jjj.de/bitwizardry/bitwizardrypage.html
EDIT:
This is the definition in the source for function ld_neq:
static inline bool ld_neq(ulong x, ulong y)
// Return whether floor(log2(x))!=floor(log2(y))
{ return ( (x^y) > (x&y) ); }
I don't know it this is too much tricky :)
I would convert the binary number to dec and then I would return the logaritm base 2 of the number directly (converted from float to int).
The solution is the (returned number + 1) bit starting from the right.
To your answer as far as I know its the left-most 1
I tihnk this is kind of a trick question. The most significant bit is always going to be a 1 :-). If interviewers like lateral thinking, that answer should be a winner!

How to wrap a number using mod operator

Not sure if this is possible, but is there an automatic way, using mod or something similiar, to automatically correct bad input values? For example:
If r>255, then set r=255 and
if r<0, then set r=0
So basically what I'm asking is whats a clever mathematical way to set this rather than using
if(r>255)
r=255;
if(r<0)
r=0;
How about:
r = std:max(0, std::min(r, 255));
The following function will output what you are looking for:
f(x) = (510*(1 + Sign[-255 + x]) + x*(1 + Sign[255 - x])*(1 + Sign[x]))/4
As shown here:
Could you do something like --
R = MIN(r, 255);
R = MAX(R, 0);
Depending on how your hardware and possibly how your interpreter deal with ints, you can do this:
Assuming that an unsigned int is 16 bits (to keep my masks short):
r = r & 0000000011111111;
If an int was 32 bits, you'd need 16 more zeros at the start of the bit mask.
After that bitwise AND, the maximum value r can have is 255. Depending on the hardware, an unsigned int might do something odd given a value below zero. I believe that case is already handled by the bitmask (at least on the hardware that I've used). If not, you can do r = min(r, 0); first.
I had similar problem when dealing with images. For some special values (like these ones, 0 and 255) you can use this nonportable method:
static inline int trim_8bit(unsigned i){
return 0xff & ((i | -!!(i & ~0xff))) + (i >> 31);
// where "0xff &" can be omitted if you return unsigned char
};
In real cases the clamping have to be performed rarely, so that you could write
static inline unsigned char trim_8bit_v2(unsigned i){
if (__builtin_expect(i & ~0xFF, 0)) // it's for gcc, use __assume for MSVC
return (i >> 31) - 1;
return i;
};
And to be sure which is fastest, measure.

Resources