Insertion Sort of uint16_t halfwords? - sorting

This is the first code for sorting whole signed words (int32_t):
I tried to change it to sort uint16_t unsigned halfwords. I'm almost done the code but something is missing. The problem is the sort, the architecture is ARM (in ARM mode, not Thumb); this is what I have done until now:
sort: STMFD SP!,{R4-R6,LR}
MOV R2,#1 //for (unsigned i = 1;
L1: CMP R2,R1
BHS L4 // i < toSort.size();
MOV R3,R2 // for (unsigned j = i
L2: SUBS R3,R3,#1// - 1; --j)
BLO L3 // j != -1;
ADD R6,R0,R3,LSL #2
LDRH R4,[R6]
LDRH R5,[R6,#2]
CMP R5,R4 // if (toSort[j+1] < toSort[j])
STRHT R4,[R6,#2]
STRHT R5,[R6] // swap(toSort[j], toSort[j+1]);
BLT L2
L3: ADD R2,R2,#1// else break; ++i)
B L1
L4: LDMFD SP!,{R4-R6,PC}
void insertionSort(vector<int>& toSort)
{
for (int i = 1; i < toSort.size(); ++i)
for (int j = i - 1; j >= 0; --j)
if (toSort[j+1] < toSort[j])
swap(toSort[j], toSort[j+1]);
else
break;
} */ This is the code that should be in assembly

Based on the original, your changes look reasonable, except that you need to shift #1 instead of #2 in the ADD R6,R0,R3,LSL #2.  This is because the #2 is for 4 byte integers, #1 would be for 2 byte integers.  This is a shift amount and the shift count reflects the power of 2 by which you want to scale the index.  For 4 byte integers we want to multiply/scale by 4, which is 22, whereas for 2 byte integers we want to multiply/scale by 2, which is 21.  When shifting in binary, the number of bits to shift multiplies by that power of 2.
However, I find it very difficult to believe that the original insertion sort actually works.  (Did you test it independently of your modifications?)
In particular, as I have said, "the swap is happening before the condition test", and, further your instruction told you that the strh should be conditional, another way of describing the same problem.
So, that code is not following the insertion sort as described in comment.
The intent of the if-statement within the inner loop is to detect when insertion point has been reached, and immediately stop the inner loop.  Or else if the insertion point hasn't been reached, then swap elements and continue the inner loop.
That code is instead doing the following:
swap ( toSort[j], toSort[j+1 );
if ( toSort[j+1] >= toSort[j] )
break;
With that being at the end of the inner loop.
Or more specifically:
conditionCodes = toSort[j+1] < toSort[j];
swap ( toSort[j], toSort[j+1 );
if ( conditionCodes is less than )
continue;
break;
The effect of this is that when the proper stopping point is reached, it will stop, but will also have unwantedly swapped two elements.
You can stop that swap from happening in the last iteration, either by changing the flow of control so that those store instructions are not executed, or as #fuz says, on ARM you can make those store instructions themselves each conditional, which is probably what the original author intended, given how far apart the CMP and BLT are.  However, the conditional operation of those stores was lost somewhere along the way.

Related

Are the traditional Fletcher/Adler checksum algorithms optimal?

Are the traditional Fletcher/Adler checksum algorithms optimal?
I am not referring to the common optimizations applied to these
algorithms. For example, the controlling of loop iterations to
avoid sum overflows then truncating outside of the main loop.
I am referring to the design itself. I understand the second sum
(sum2) was introduced to make the algorithm position-dependent but
it is truly sub-optimal? Sum2 is just a modification of sum1 (the
sum of the previous sum1 value added to the current sum1 value).
If we take the 16-bit version as our example, sum1 is a 8-bit
product of the input data, while sum2 is a product of sum1, so
therefore the final 16-bit checksum is in fact an 8-bit product
of the data appended to 16-bits to catch out-of-sequence input.
This means our final checksum, an 8-bit sum of our input data,
is a value of only 256 possible values. As we are passing on
a 16-bit value with our data we could have a checksum value that
is one of 65536 possible values.
It occurred to me that if we had a means of ensuring a position-
dependent check without sacrificing any bits of the checksum we
would have an exponentially better validation.
As I understand it, sum2 was introduced solely to identify out-
of-sequence transmission, and so can be discarded if we find an
alternative to producing a position-dependent checksum. And as
it turns out, we do have an alternative and it comes cheap.
It is cheap because it does not add extra coding to the process
compared to current 'sum2' designs - it is the index position
in the sequence that when hashed with each corresponding byte
ensures a position-dependent check.
One final note - the design below is free of overflow checks,
lossless reduction, and possible loop optimizations, just for
clarity, as this is
about a new out-of-sequence error check technique, not about
implementation details. Fletch64 is not demonstrated below as
it may require a more complicated implementation but the
byte/index hash applies the same.
Revision - because a checksum algorithm can process a large data
count the index position check value could itself cause premature
overflow and require a higher number of reduction operations with
lower inner loop count. The fix was to truncate the index position
check to 8 bits. Now the checksum can process a much greater data
count in the inner loop before overflow.
The only down-side to this change is if a contiguous string of the
data of exactly 256 byte is displaced any multiple of 256 positions
from its original position the error would go undetected.
uint8_t Fletch8( uint8_t *data, int count )
{
uint32_t sum = 0;
int index;
for ( index = 0; index < count; ++index )
sum = sum + ( data[index] xor index & ffh );
return sum & ffh;
}
uint16_t Fletch16( uint8_t *data, int count )
{
uint32_t sum = 0;
int index;
for ( index = 0; index < count; ++index )
sum = sum + ( data[index] xor index & ffh );
return sum & ffffh;
}
uint32_t Fletch32( uint8_t *data, int count )
{
uint64_t sum = 0;
int index;
for ( index = 0; index < count; ++index )
sum = sum + ( data[index] xor index & ffh );
return sum & ffffffffh;
}

Formal proof of a recursive Quicksort using frama-c

As homework, I've decided to try verify an implementation of quicksort (taken and adapted from here) using frama-c with wp and rte plugins. Note that at first leftmost is 0 and rightmost is equal to size-1. Here my proof.
/*#
requires \valid(a);
requires \valid(b);
ensures *a == \old(*b);
ensures *b == \old(*a);
assigns *a,*b;
*/
void swap(int *a, int *b)
{
int temp = *a;
*a = *b;
*b = temp;
}
/*#
requires \valid(t +(leftmost..rightmost));
requires 0 <= leftmost;
requires 0 <= rightmost;
decreases (rightmost - leftmost);
assigns *(t+(leftmost..rightmost));
*/
void quickSort(int * t, int leftmost, int rightmost)
{
// Base case: No need to sort arrays of length <= 1
if (leftmost >= rightmost)
{
return;
} // Index indicating the "split" between elements smaller than pivot and
// elements greater than pivot
int pivot = t[rightmost];
int counter = leftmost;
/*#
loop assigns i, counter, *(t+(leftmost..rightmost));
loop invariant 0 <= leftmost <= i <= rightmost + 1 <= INT_MAX ;
loop invariant 0 <= leftmost <= counter <= rightmost;
loop invariant \forall int i; leftmost <= i < counter ==> t[i] <= pivot;
loop variant rightmost - i;
*/
for (int i = leftmost; i <= rightmost; i++)
{
if (t[i] <= pivot)
{
/*#assert \valid(&t[counter]);*/
/*#assert \valid(&t[i]);*/
swap(&t[counter], &t[i]);
counter++;
}
}
// NOTE: counter is currently at one plus the pivot's index
// (Hence, the counter-2 when recursively sorting the left side of pivot)
quickSort(t, leftmost, counter-2); // Recursively sort the left side of pivot
quickSort(t, counter, rightmost); // Recursively sort the right side of pivot
}
As side note, I know that wp doesn't support recursion hence the ignored decreases statement when running Frama-c -wp -wp-rte.
here is the result in the gui:
As you can see my loop invariants are not verified even though it makes senses to me.
Frama-c able to verify under hypotheses the second recursive call when it's not supporting recursion. To my understanding the call quickSort(t, leftmost, counter-2) isn't verified since can violate the precondition requires 0 <= rightmost. I'am not too sure about Frama-c behaviour in that case though and how to tackle it.
I would like some input about what is going on. I think that the invariant not being verified as nothing to do with recursion as even by removing the recursion calls, they aren't verified. And finally could you explain to me what is the Frama-c behaviour in the case of the recursive calls? Are they treated as any other function call or is there a behaviour that I'am unaware of?
Thanks
First, unlike Eva, WP has no real problem with recursive functions, apart from proving termination, which is completely orthogonal to prove that the post-condition holds each time the function returns (meaning that we don't have to prove anything for the non-terminating cases): in the literature, this is referred to as partial correctness vs. total correctness when you can also prove that the function always terminates. The decreases clause only serves to prove termination, so that the fact that it is unsupported is only an issue if you want total correctness. For partial correctness, everything is fine.
Namely for partial correctness, a recursive call is treated like any other call: you take the contract of the callee, prove that the pre-condition holds at this point, and try to prove the post-condition of the caller assuming that the post-condition of the callee holds after the call. Recursive calls are in fact easier for the developer: since the caller and the callee are the same, you get to write only one contract 😛.
Now regarding the proof obligations that fail: when the 'established' part of a loop invariant fails, it is often a good idea to start investigating that. This is usually a simpler proof obligation than the preservation: for the established part, you want to prove that the annotation holds when you encounter the loop the first time (i.e. this is the base case), while for the preservation, you have to prove that if you assume the invariant true at the beginning of an arbitrary loop step, it stays true at the end of said step (i.e. this is the inductive case). In particular, you can not deduce from your pre-conditions that right_most+1 <= INT_MAX. Namely, if you have rightmost == INT_MAX, you will encounter issues, especially as the final i++ will overflow. In order to avoid such arithmetic subtleties, it is probably simpler to use size_t for leftmost and to consider rightmost to be one past the greatest offset to consider. However, if you requires that both leftmost and rightmost to be strictly less than INT_MAX, then you will be able to proceed.
However, that is not all. First, your invariant for bounding counter is too weak. You want that counter<=i, not merely that counter<=rightmost. Finally, it is necessary to guard the recursive calls to avoid violating the pre-conditions for leftmost or rightmost in case the pivot was ill-chosen and your original indices were close to the limit (i.e. counter ends up being 0 or 1 because the pivot was too small or INT_MAX because it was too big. In any case, this can only happen if the corresponding side would be empty).
In the end, the following code gets completely proved by WP (Frama-C 20.0 Calcium, using -wp -wp-rte):
#include <limits.h>
/*#
requires \valid(a);
requires \valid(b);
ensures *a == \old(*b);
ensures *b == \old(*a);
assigns *a,*b;
*/
void swap(int *a, int *b)
{
int temp = *a;
*a = *b;
*b = temp;
}
/*#
requires \valid(t +(leftmost..rightmost));
requires 0 <= leftmost < INT_MAX;
requires 0 <= rightmost < INT_MAX;
decreases (rightmost - leftmost);
assigns *(t+(leftmost..rightmost));
*/
void quickSort(int * t, int leftmost, int rightmost)
{
// Base case: No need to sort arrays of length <= 1
if (leftmost >= rightmost)
{
return;
} // Index indicating the "split" between elements smaller than pivot and
// elements greater than pivot
int pivot = t[rightmost];
int counter = leftmost;
/*#
loop assigns i, counter, *(t+(leftmost..rightmost));
loop invariant 0 <= leftmost <= i <= rightmost + 1;
loop invariant 0 <= leftmost <= counter <= i;
loop invariant \forall int i; leftmost <= i < counter ==> t[i] <= pivot;
loop variant rightmost - i;
*/
for (int i = leftmost; i <= rightmost; i++)
{
if (t[i] <= pivot)
{
/*#assert \valid(&t[counter]);*/
/*#assert \valid(&t[i]);*/
swap(&t[counter], &t[i]);
counter++;
}
}
// NOTE: counter is currently at one plus the pivot's index
// (Hence, the counter-2 when recursively sorting the left side of pivot)
if (counter >= 2)
quickSort(t, leftmost, counter-2); // Recursively sort the left side of pivot
if (counter < INT_MAX)
quickSort(t, counter, rightmost); // Recursively sort the right side of pivot
}

Using SIMD to find the biggest difference of two elements

I wrote an algorithm to get the biggest difference between two elements in an std::vector where the bigger of the two values must be at a higher index than the lower value.
unsigned short int min = input.front();
unsigned short res = 0;
for (size_t i = 1; i < input.size(); ++i)
{
if (input[i] <= min)
{
min = input[i];
continue;
}
int dif = input[i] - min;
res = dif > res ? dif : res;
}
return res != 0 ? res : -1;
Is it possible to optimize this algorithm using SIMD? I'm new to SIMD and so far I've been unsuccessful with this one
You didn't specify any particular architecture so I'll keep this mostly architecture neutral with an algorithm described in English. But it requires a SIMD ISA that can efficiently branch on SIMD compare results to check a usually-true condition, like x86 but not really ARM NEON.
This won't work well for NEON because it doesn't have a movemask equivalent, and SIMD -> integer causes stalls on many ARM microarchitectures.
The normal case while looping over the array is that an element, or a whole SIMD vector of elements, is not a new min, and not diff candidate. We can quickly fly through those elements, only slowing down to get the details right when there's a new min. This is like a SIMD strlen or SIMD memcmp, except instead of stopping at the first search hit, we just go scalar for one block and then resume.
For each vector v[0..7] of the input array (assuming 8 int16_t elements per vector (16 bytes), but that's arbitrary):
SIMD compare vmin > v[0..7], and check for any elements being true. (e.g. x86 _mm_cmpgt_epi16 / if(_mm_movemask_epi8(cmp) != 0)) If there's a new min somewhere, we have a special case: the old min applies to some elements, but the new min applies to others. And it's possible there are multiple new-min updates within the vector, and new-diff candidates at any of those points.
So handle this vector with scalar code (updating a scalar diff which doesn't need to be in sync with the vector diffmax because we don't need position).
Broadcast the final min to vmin when you're done. Or do a SIMD horizontal min so out-of-order execution of later SIMD iterations can get started without waiting for a vmin from scalar. Should work well if the scalar code is branchless, so there are no mispredicts in the scalar code that cause later vector work to be thrown out.
As an alternative, a SIMD prefix-sum type of thing (actually prefix-min) could produce a vmin where every element is the min up to that point. (parallel prefix (cumulative) sum with SSE). You could always do this to avoid any branching, but if new-min candidates are rare then it's expensive. Still, it could be viable on ARM NEON where branching is hard.
If there's no new min, SIMD packed max diffmax[0..7] = max(diffmax[0..7], v[0..7]-vmin). (Use saturating subtraction so you don't get wrap-around to a large unsigned difference, if you're using unsigned max to handle the full range.)
At the end of the loop, do a SIMD horizontal max of the diffmax vector. Notice that since we don't need the position of the max-difference, we don't need to update all elements inside the loop when one finds a new candidate. We don't even need to keep the scalar special-case diffmax and SIMD vdiffmax in sync with each other, just check at the end to take the max of the scalar and SIMD max diffs.
SIMD min/max is basically the same as a horizontal sum, except you use packed-max instead of packed-add. For x86, see Fastest way to do horizontal float vector sum on x86.
Or on x86 with SSE4.1 for 16-bit integer elements, phminposuw / _mm_minpos_epu16 can be used for min or max, signed or unsigned, with appropriate tweaks to the input. max = -min(-diffmax). You can treat diffmax as unsigned because it's known to be non-negative, but Horizontal minimum and maximum using SSE shows how to flip the sign bit to range-shift signed to unsigned and back.
We probably get a branch mispredict every time we find a new min candidate, or else we're finding new min candidates too often for this to be efficient.
If new min candidates are expected frequently, using shorter vectors could be good. Or on discovering there's a new-min in a current vector, then use narrower vectors to only go scalar over fewer elements. On x86, you might use bsf (bit-scan forward) to find which element had the first new-min. That gives your scalar code a data dependency on the vector compare-mask, but if the branch to it was mispredicted then the compare-mask will be ready. Otherwise if branch-prediction can somehow find a pattern in which vectors need the scalar fallback, prediction+speculative execution will break that data dependency.
Unfinished / broken (by me) example adapted from #harold's deleted answer of a fully branchless version that constructs a vector of min-up-to-that-element on the fly, for x86 SSE2.
(#harold wrote it with suffix-max instead of min, which is I think why he deleted it. I partially converted it from max to min.)
A branchless intrinsics version for x86 could look something like this. But branchy is probably better unless you expect some kind of slope or trend that makes new min values frequent.
// BROKEN, see FIXME comments.
// converted from #harold's suffix-max version
int broken_unfinished_maxDiffSSE(const std::vector<uint16_t> &input) {
const uint16_t *ptr = input.data();
// construct suffix-min
// find max-diff at the same time
__m128i min = _mm_set_epi32(-1);
__m128i maxdiff = _mm_setzero_si128();
size_t i = input.size();
for (; i >= 8; i -= 8) {
__m128i data = _mm_loadu_si128((const __m128i*)(ptr + i - 8));
// FIXME: need to shift in 0xFFFF, not 0, for min.
// or keep the old data, maybe with _mm_alignr_epi8
__m128i d = data;
// link with suffix
d = _mm_min_epu16(d, _mm_slli_si128(max, 14));
// do suffix-min within block.
d = _mm_min_epu16(d, _mm_srli_si128(d, 2));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xFA));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xEE));
max = d;
// update max-diff
__m128i diff = _mm_subs_epu16(data, min); // with saturation to 0
maxdiff = _mm_max_epu16(maxdiff, diff);
}
// horizontal max
maxdiff = _mm_max_epu16(maxdiff, _mm_srli_si128(maxdiff, 2));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xFA));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xEE));
int res = _mm_cvtsi128_si32(maxdiff) & 0xFFFF;
unsigned scalarmin = _mm_extract_epi16(min, 7); // last element of last vector
for (; i != 0; i--) {
scalarmin = std::min(scalarmin, ptr[i - 1]);
res = std::max(res, ptr[i - 1] - scalarmin);
}
return res != 0 ? res : -1;
}
We could replace the scalar cleanup with a final unaligned vector, if we handle the overlap between the last full vector min.

Popcount of SSE vectors for binary correlation?

I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)
Here is the much simplified code:
int correlation(uint64_t *v1, uint64_t *v2, int size64) {
__m128i* a = reinterpret_cast<__m128i*>(v1);
__m128i* b = reinterpret_cast<__m128i*>(v2);
int count = 0;
for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
union { __m128i s; uint64_t b[2]} x;
x.s = _mm_xor_si128(*a, *b);
count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
}
return count;
}
I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.
I am happy with its current performance but I was just curious to see if it could be further improved.
Thanks.
Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.
for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
__m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
__m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}
Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and
-fprefetch-loop-arrays args. Something like this:
for (int j = 0; j < size64; ++j) {
count += __builtin_popcountll(a[j] ^ b[j]);
}
Third Edit:
This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to #Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.
static inline __m128i hamming128(__m128i a, __m128i b) {
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const __m128i x = _mm_xor_si128(a, b);
const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
return _mm_add_epi8(pcnt0, pcnt1);
}
On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.
The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.
uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0;
for(size_t i = 0; i < count; i++) {
sum += popcnt(a[i] ^ b[i]);
}
return sum;
}
This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:
uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=2) {
sum += popcnt(a[i ] ^ b[i ]);
sum2 += popcnt(a[i+1] ^ b[i+1]);
}
return sum + sum2;
}
This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed # 2.4Ghz).
How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.
Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).
$LL3#count_set_:
; Line 50
mov rcx, QWORD PTR [r10+rax-8]
lea rax, QWORD PTR [rax+32]
xor rcx, QWORD PTR [rax-40]
popcnt rcx, rcx
add r9, rcx
; Line 51
mov rcx, QWORD PTR [r10+rax-32]
xor rcx, QWORD PTR [rax-32]
popcnt rcx, rcx
add r11, rcx
; Line 52
mov rcx, QWORD PTR [r10+rax-24]
xor rcx, QWORD PTR [rax-24]
popcnt rcx, rcx
add rbx, rcx
; Line 53
mov rcx, QWORD PTR [r10+rax-16]
xor rcx, QWORD PTR [rax-16]
popcnt rcx, rcx
add rdi, rcx
dec rdx
jne SHORT $LL3#count_set_
You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.
Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.
I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.
For reference these are my benchmarks:
"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:
uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum1 = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=4) {
__m128i tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
}
return sum1 + sum2;
}
Have a look here. There is an SSSE3 version that beats the popcnt instruction by a lot. I'm not sure but you may be able to extend it to AVX as well.

calculating the number of bits using K&R method with infinite memory

I got answer for the question, counting number of sets bits from here.
How to count the number of set bits in a 32-bit integer?
long count_bits(long n) {
unsigned int c; // c accumulates the total bits set in v
for (c = 0; n; c++)
n &= n - 1; // clear the least significant bit set
return c;
}
It is simple to understand also. And found the best answer as Brian Kernighans method, posted by hoyhoy... and he adds the following at the end.
Note that this is an question used during interviews. The interviewer will add the caveat that you have "infinite memory". In that case, you basically create an array of size 232 and fill in the bit counts for the numbers at each location. Then, this function becomes O(1).
Can somebody explain how to do this ? If i have infinite memory ...
The fastest way I have ever seen to populate such an array is ...
array[0] = 0;
for (i = 1; i < NELEMENTS; i++) {
array[i] = array[i >> 1] + (i & 1);
}
Then to count the number of set bits in a given number (provided the given number is less than NELEMENTS) ...
numSetBits = array[givenNumber];
If your memory is not finite, I often see NELEMENTS set to 256 (for one byte's worth) and add the number of set bits in each byte in your integer.
int counts[MAX_LONG];
void init() {
for (int i= 0; i < MAX_LONG; i++)
{
counts[i] = count_bits[i]; // as given
}
}
int count_bits_o1(long number)
{
return counts[number];
}
You can probably pre-populate the array more wiseley, i.e. fill with zeros, then every second index add one, then every fourth index add 1, then every eighth index add 1 etc, which might be a bit faster, although I doubt it...
Also, you might account for unsigned values.

Resources