Pseudorandom generator in Assembly Language - algorithm

I need a pseudorandom number generator algorithm for a assembler program assigned in a course, and I would prefer a simple algorithm. However, I cannot use an external library.
What is a good, simple pseudorandom number generator algorithm for assembly?

Easy one is to just choose two big relative primes a and b, then keep multiplying your random number by a and adding b. Use the modulo operator to keep the low bits as your random number and keep the full value for the next iteration.
This algorithm is known as the linear congruential generator.

Volume 2 of The Art of Computer Programming has a lot of information about pseudorandom number generation. The algorithms are demonstrated in assembler, so you can see for yourself which are simplest in assembler.
If you can link to an external library or object file, though, that would be your best bet. Then you could link to, e.g., Mersenne Twister.
Note that most pseudorandom number generators are not safe for cryptography, so if you need secure random number generation, you need to look beyond the basic algorithms (and probably should tap into OS-specific crypto APIs).

Simple code for testing, don't use with Crypto
From Testing Computer Software, page 138
With is 32 bit maths, you don't need the operation MOD 2^32
RNG = (69069*RNG + 69069) MOD 2^32

Well - Since I haven't seen a reference to the good old Linear Feedback Shift Register I post some SSE intrinsic based C-Code. Just for completenes. I wrote that thing a couple of month ago to sharpen my SSE-skills again.
#include <emmintrin.h>
static __m128i LFSR;
void InitRandom (int Seed)
{
LFSR = _mm_cvtsi32_si128 (Seed);
}
int GetRandom (int NumBits)
{
__m128i seed = LFSR;
__m128i one = _mm_cvtsi32_si128(1);
__m128i mask;
int i;
for (i=0; i<NumBits; i++)
{
// generate xor of adjecting bits
__m128i temp = _mm_xor_si128(seed, _mm_srli_epi64(seed,1));
// generate xor of feedback bits 5,6 and 62,61
__m128i NewBit = _mm_xor_si128( _mm_srli_epi64(temp,5),
_mm_srli_epi64(temp,61));
// Mask out single bit:
NewBit = _mm_and_si128 (NewBit, one);
// Shift & insert new result bit:
seed = _mm_or_si128 (NewBit, _mm_add_epi64 (seed,seed));
}
// Write back seed...
LFSR = seed;
// generate mask of NumBit ones.
mask = _mm_srli_epi64 (_mm_cmpeq_epi8(seed, seed), 64-NumBits);
// return random number:
return _mm_cvtsi128_si32 (_mm_and_si128(seed,mask));
}
Translating this code to assembler is trivial. Just replace the intrinsics with the real SSE instructions and add a loop around it.
Btw - the sequence this code genreates repeats after 4.61169E+18 numbers. That's a lot more than you'll get via the prime method and 32 bit arithmetic. If unrolled it's faster as well.

#jjrv
What you're describing is actually a linear congrential generator. The most random bits are the highest bits. To get a number from 0..N-1 you multiply the full value by N (32 bits by 32 bits giving 64 bits) and use the high 32 bits.
You shouldn't just use any number for a (the multiplier for progressing from one full value to the next), the numbers recommended in Knuth (Table 1 section 3.3.4 TAOCP vol 2 1981) are 1812433253, 1566083941, 69069 and 1664525.
You can just pick any odd number for b. (the addition).

Why not use an external library??? That wheel has been invented a few hundred times, so why do it again?
If you need to implement an RNG yourself, do you need to produce numbers on demand -- i.e. are you implementing a rand() function -- or do you need to produce streams of random numbers -- e.g. for memory testing?
Do you need an RNG that is crypto-strength? How long does it have to go before it repeats? Do you have to absolutely, positively guarantee uniform distribution of all bits?
Here's simple hack I used several years ago. I was working in embedded and I needed to test RAM on power-up and I wanted really small, fast code and very little state, and I did this:
Start with an arbitrary 4-byte constant for your seed.
Compute the 32-bit CRC of those 4 bytes. That gives you the next 4 bytes
Feed back those 4 bytes into the CRC32 algorithm, as if they had been appended. The CRC32 of those 8 bytes is the next value.
Repeat as long as you want.
This takes very little code (although you need a table for the crc32 function) and has very little state, but the psuedorandom output stream has a very long cycle time before it repeats. Also, it doesn't require SSE on the processor. And assuming you have the CRC32 function handy, it's trivial to implement.

Using masm615 to compiler:
delay_function macro
mov cx,0ffffh
.repeat
push cx
mov cx,0f00h
.repeat
dec cx
.until cx==0
pop cx
dec cx
.until cx==0
endm
random_num macro
mov cx,64 ;assum we want to get 64 random numbers
mov si,0
get_num:
push cx
delay_function ;since cpu clock is fast,so we use delay_function
mov ah,2ch
int 21h
mov ax,dx ;get clock 1/100 sec
div num ;assume we want to get a number from 0~num-1
mov arry[si],ah ;save to array you set
inc si
pop cx
loop get_num ;here we finish the get_random number

also you probably can emulate shifting register with XOR sum elements between separate bits, which will give you pseudo-random sequence of numbers.

Linear congruential (X = AX+C mod M) PRNG's might be a good one to assign for an assembler course as your students will have to deal with carry bits for intermediate AX results over 2^31 and computing a modulus. If you are the student they are fairly straightforward to implement in assembler and may be what the lecturer had in mind.

Related

Uniform random bit from a mask

Suppose I have a 64 bit unsigned integer (u64) mask, with one or more bits set.
I want to select one of the set bits uniformly at random from m to give a new mask x such that x & mask has one bit set. Some pseudocode that does this might be:
def uniform_random_bit_from_mask(mask):
assert mask > 0
set_indices = get_set_indices(mask)
random_index = uniform_random_choice(set_indices)
new_mask = set_bit(random_index, 0)
return new_mask
However I need to do this as fast as possible (code similar to the above in a low-level language is slowing a hot loop). Does anyone have a more efficient scheme?
The details how to optimize this depend on several factors you did not specify – the target architecture, the expected number of set bits in the mask, the language you want to use, the requirements on the randomness and many more. Without knowing further details, it's hard to give a useful answer, but I'll give a few hints that may prove useful anyway.
Most modern architectures have an instruction to count the number of set bits in an integer, generally called "popcount", and this instruction is exposed in most low-level languages. In Rust, you can use the count_ones() method. This gives you the total number k of bits to select from.
You can then generate a random number i between 0 and k - 1 (inclusive). The next step is to select the ith set bit in mask. An efficient approach to do so is this loop (Rust code):
for _ in 0..i {
mask &= mask - 1;
}
let new_mask = 1 << mask.trailing_zeros();
The loop clears the least significant set bit in each iteration. Since i < k, we know that mask can't be zero after the loop. The last line generates a new mask from the least significant bit of mask that is still set.
On common architectures, it is likely that the bottleneck will be the random number generator. If you are using Rust's rand crate, you can use SmallRng for improved performance, at the cost of being cryptographically insecure, which may not be relevant for your use case.

Iterating over bits in FPGA

Now I'm trying to figure out best method for iterating over bits in FPGA. I'm using some variation of fast powering algorithm, a.k.a exponentiation by squaring (more precisely it's doubling and add algorithm for elliptic curve mathematics). To implement it on hardware, I know I must use FSM which does iteration. My problem is how to properly "handle" moving from bit to bit. My first thought was to switch order of bytes, but when my k = 17 is 32bit, I must discard first 27 bits, so it's rather stupid idea. Another concept was with "moving" 0001000 pattern and bitwise & it with number, but it also requires to find first nonzero bit.
TL&DR
Got for example k = 17 (32bits, so: 17x0 10001) and want to iterate 5 times (that means I start iteration on first "real" bit of number) knowing each bit I iterate over.
Language doesn't matter - I need only the algorithm, not solution in specific language. However, if it is easily done in Verilog, I wouldn't mind. :P
A dedicated combinatorial circuit to find the first nonzero bit, shift it to the first position and tell you the shift amount should be fairly light on resources.
In principle, the compiler should be able to find this solution on its own and improve on it:
if none of the top 16 bits are set, set bit 4 of the shift amount, and shift by 16.
if none of the top 8 bits are set, set bit 3 of the shift amount, and shift by 8.
...
The compiler should be able to find further optimizations on this.
Don not code for FPGA but still:
rewrite algorithm to iterate number x from LSB to MSB
then in each iteration bit shift x right by 1 bit
stop if x==0.
this way you have bit-scan inside your main loop and do not need additional cycles for it.
x!=0 is done easily by ORing all its bits together
C++ code example:
DWORD x = ...;
for (; x != 0; x >>= 1)
{
//here is your iteration loop stuff like:
if (DWORD(x & 1) !=0 ) ...;
}
Something like:
always # *
casex(num)
8XXX_XXXX: k = 32;
4XXX_XXXX: k = 31;
2XXX_XXXX: k = 30;
...
Should give you the value of k.
You can have a shift register which can be parallel loaded so you can write a 1 to the kth bit, so you know when your iterations have ended.
If you loop from 0 to 31 and discard the 27 leading zeros...you aren't necessarily wasting cycles. Depends on whether you've surrounded this with a synchronous process, or a asynchronous one.
One gives you a rather small clocked circuit with a 32 clock latency.
The other gives you a giant rats nest of ANDs and ORs which won't run at a very high frequency.
Depends on what you want. Remember though, that even if you do decide to loop over 32 clocks, you can PIPELINE it such that you start a new calculation every clock. It might take you 32 clocks to get an answer, but you CAN do them at high speed.

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.
How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)
In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.
I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Finding seeds for a 5 byte PRNG

An old idea, but ever since then I couldn't get around finding some reasonably good way to solve the problem it raised. So I "invented" (see below) a very compact, and in my opinion, reasonably well performing PRNG, but I can't get to figure out algorithms to build suitable seed values for it at large bit depths. My current solution is simply brute-forcing, it's running time is O(n^3).
The generator
My idea came from XOR taps (essentially LFSRs) some old 8bit machines used for sound generation. I fiddled with XOR as a base on a C64, tried to put together opcodes, and experienced with the result. The final working solution looked like this:
asl
adc #num1
eor #num2
This is 5 bytes on the 6502. With a well chosen num1 and num2, in the accumulator it iterates over all 256 values in a seemingly random order, that is, it looks reasonably random when used to fill the screen (I wrote a little 256b demo back then on this). There are 40 suitable num1 & num2 pairs for this, all giving decent looking sequences.
The concept can be well generalized, if expressed in pure C, it may look like this (BITS being the bit depth of the sequence):
r = (((r >> (BITS-1)) & 1U) + (r << 1) + num1) ^ num2;
r = r & ((1U<<BITS)-1U);
This C code is longer since it is generalized, and even if one would use the full depth of an unsigned integer, C wouldn't have the necessary carry logic to transfer the high bit of the shift to the add operation.
For some performance analysis and comparisons, see below, after the question(s).
The problem / question(s)
The core problem with the generator is finding suitable num1 and num2 which would make it iterate over the whole possible sequence of a given bit depth. At the end of this section I attach my code which just brute-forces it. It will finish in reasonable time for up to 12 bits, you may wait for all 16 bits (there are 5736 possible pairs for that by the way, acquired with an overnight full search a while ago), and you may get a few 20 bits if you are patient. But O(n^3) is really nasty...
(Who will get to find the first full 32bit sequence?)
Other interesting questions which arise:
For both num1 and num2 only odd values are able to produce full sequences. Why? This may not be hard (simple logic, I guess), but I never reasonably proved it.
There is a mirroring property along num1 (the add value), that is, if 'a' with a given 'b' num2 gives a full sequence, then the 2 complement of 'a' (in the given bit depth) with the same num2 is also a full sequence. I only observed this happening reliably with all the full generations I calculated.
A third interesting property is that for all the num1 & num2 pairs the resulting sequences seem to form proper circles, that is, at least the number zero seems to be always part of a circle. Without this property my brute force search would die in an infinite loop.
Bonus: Was this PRNG already known before? (and I just re-invented it)?
And here is the brute force search's code (C):
#define BITS 16
#include "stdio.h"
#include "stdlib.h"
int main(void)
{
unsigned int r;
unsigned int c;
unsigned int num1;
unsigned int num2;
unsigned int mc=0U;
num1=1U; /* Only odd add values produce useful results */
do{
num2=1U; /* Only odd eor values produce useful results */
do{
r= 0U;
c=~0U;
do{
r=(((r>>(BITS-1)) & 1U)+r+r+num1)^num2;
r&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U); /* 32bit safe */
c++;
}while (r);
if (c>=mc){
mc=c;
printf("Count-1: %08X, Num1(adc): %08X, Num2(eor): %08X\n", c, num1, num2);
}
num2+=2U;
num2&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U);
}while(num2!=1U);
num1+=2U;
num1&=((1U<<(BITS-1))-1U); /* Do not check complements */
}while(num1!=1U);
return 0;
}
This, to show it is working, after each iteration will output the pair found if it's sequence length is equal or longer than the previous. Modify the BITS constant for sequences of other depths.
Seed hunting
I did some graphing relating to the seeds. Here is a nice image showing all the 9bit sequence lengths:
The white dots are the full length sequences, X axis is for num1 (add), Y axis is for num2 (xor), the brighter the dot, the longer the sequence. Other bit depth look very similar in pattern: they all seem to be broken up to sixteen major tiles with two patterns repeating with mirroring. The similarity of the tiles is not complete, for example above a diagonal from the up-left corner to the bottom-right is clearly visible while it's opposite is absent, but for the full-length sequences this property seems to be reliable.
Relying on this it is possible to reduce the work even more than by the previous assumptions, but that's still O(n^3)...
Performance analysis
As of current the longest sequences possible to be generated are 24bits: on my computer it takes at about 5 hours to brute-force a full 24bit sequence for this. This is still just so-so for real PRNG tests such as Diehard, so as of now I rather gone by an own approach.
First it's important to understand the role of the generator. This by no means would be a very good generator for it's simplicity, it's goal is rather to produce decent numbers blazing fast. On this region not needing multiply / divide operations, a Galois LFSR can produce similar performance. So my generator is of any use if it is capable to outperform this one.
The test I performed were all of 16bit generators. I chose this depth since it gives an useful sequence length while the numbers may still be broken up in two 8bit parts making it possible to present various bit-exact graphs for visual analysis.
The core of the tests were looking for correlations along previous and currently generated numbers. For this I used X:Y plots where the previous generation was the Y, the current the X, both broken up to low / high parts as above mentioned for two graphs. I created a program capable of plotting these stepped in real time so to also make it possible to roughly examine how the numbers follow each other, how the graphs fill up. Here obviously only the end results are shown as the generators ran through their full 2^16 or 2^16-1 (Galois) cycle.
The explanation of the fields:
The images consist 8x2 256x256 graphs making the total image size 2048x512 (check them at original size).
The top left graph just confirms that indeed a full sequence was plotted, it is simply an X = r % 256; Y = r / 256; plot.
The bottom left graph shows every second number only plotted the same way as the top, just confirming that the numbers occur reasonably randomly.
From the second graph the top row are the high byte correlation graphs. The first of them uses the previous generation, the next skips one number (so uses 2nd previous generation), and so on until the 7th previous generation.
From the second the bottom row are the low byte correlation graphs, organized the same way as above.
Galois generator, 0xB400 tap set
This is the generator found in the Wikipedia Galois example. It's performance is not the worst, but it is still definitely not really good.
Galois generator, 0xA55A tap set
One of the decent Galois "seeds" I found. Note that the low part of the 16bit numbers seem to be a lot better than the above, however I couldn't find any Galois "seed" which would fuzz up the high byte.
My generator, 0x7F25 (adc), 0x00DB (eor) seed
This is the best of my generators where the high byte of the EOR value is zero. Limiting the high byte is useful on 8bit machines since then this calculation can be omitted for smaller code and faster execution if the loss of randomness performance is affordable.
My generator, 0x778B (adc), 0x4A8B (eor) seed
This is one of the very good quality seeds by my measurements.
To find seeds with good correlation, I built a small program which would analyse them to some degree, the same way for Galois and mine. The "good quality" examples were pinpointed by that program, and then I tested several of them and selected one from those.
Some conclusions:
The Galois generator seems to be more rigid than mine. On all the correlation graphs definite geometrical patterns are observable (some seeds produce "checkerboard" patterns, not shown here) even if it is not composed of lines. My generator also shows patterns, but with more generations they grow less defined.
A portion of the Galois generator's result which include the bits in the high byte seems to be inherently rigid which property seems to be absent from my generator. This is a weak assumption yet probably needing some more research (to see if this is always so with the Galois generator and not with mine on other bit combinations).
The Galois generator lacks zero (maximal period being 2^16-1).
As of now it is impossible to generate a good set of seeds for my generator above 20 bits.
Later I might get in this subject deeper seeking to test the generator with Diehard, but as of now the lack of the ability of generating large enough seeds for it makes it impossible.
This is some form of a non-linear shift feedback register. I don't know if it has been used as such, but it resembles linear shift feedback registers somewhat. Read this Wikipedia page as an introduction to LSFRs. They are used frequently in pseudo random number generation.
However, your pseudo random number generator is inherently bad in that there is a linear correlation between the highest order bit of a previously generated number and the lowest order bit of a number generated next. You shift the highest bit B out, and then the lowest order bit of the new number will be the XOR or B, the lowest order bit of the additive constant num1 and the lowest order bit of the XORed constant num2, because binary addition is equivalent to exclusive or at the lowest order bit. Most likely your PRNG has other similar deficiencies. Creating good PRNGs is hard.
However, I must admit that the C64 code is pleasingly compact!

Sum reduction of binary sequence

Consider a binary sequence:
11000111
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
}
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
}
}
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.

Resources