Population Count specific algorithm explanation/Implementation on Assembly

Population Count specific algorithm explanation/Implementation on Assembly - algorithm

I found an algorithm for population count which goes like this:
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
My question is, is it possible to implement this into Assembly (MIPS) ? I don't understand how the "for" loop works, if anyone could explain what the condition is (I suspect it's 0 < v ?). There is another question about this algorithm but it does not explain the algorithm on an instruction-level depth.
Cast some light on population count algorithm
Sidenote: My hw is to implement a popcount algorithm which counts the set bits of a 32bit interger on MIPS (a subroutine) but I am not allowed to use multiplication/division by any means. Hope my question is not a duplicate spam :/

Related

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py

Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Finding all numbers which are less than a given no with less no of set bits than the given no but in the same position

How to find all the numbers which are less than a given no and which have less no of set bits than the given no, but whatever no. of set bits each of the generated no has will be in the position same as the given no's set bit positions.Giving one example suppose the given no is 13 (1101 in binary),then all the generated no will be 12(1100 in binary),9 (1001 in binary),8(1000 in binary),5(0101 in binary),4(0100 in binary),1(0001 in binary).
As it is visible that in 1100(set bit positions are 2 and 3 as in given no.1101).
I want an efficient algorithm for this.

std::vector<int> subset(int x) {
std::vector<int> res;
for(int i = 1; i < x; ++i)
if (i == (i&x))
res.push_back(i);
return res
}
It's complexity is optimal in the worst case (if x = 2^k-1).

This seems to be equivalent to enumerating the subsets of the set defined by the set bits. This is answered here: https://stackoverflow.com/a/29043170/378360

Dynamic Programming +Bit-Masks

I recently learned the concept of Bit Manipulation for Competitive Programming so I'm quite new to the concept ,I also read many tutorials on Bit-Masking + Dynamic Programming on Hackerearth ,CodeChef and many more .
I also solved a couple of problems on Codechef including this one problem
and I have a couple of doubts regarding Bitmasks after I have been through some questions.
The problems I solved were mostly focused on manipulating the subsets but I wonder how do I work on permutations with bitmasks , i.e when I have to work on a state where all bits in the mask need to be set.
For ex: If we have to find number of numbers that can be formed by arranging all digits of a given number A which are divisible by a given number B where (A ,B<= 10**6) how can this be done with bitmasks.(I hope this can be done with bitmask+dp)
If A= 514 ,and B=2
The question expects the answer to be
514
154
Which are both divisible by 2 .
So the answer is 2.
With the knowledge I have: 514 and 154 represent the same mask 111 where all bits are set So how do I use bitmasks here where the mask is same for two or more answers!( I hope you understand this ).
And also as it is impossible to allocate memory worth n!*n for a little large value of n since we can have that many permutations of digits how can this problem be done using bitmasks where we need only (2**n)*n space (If i'm not wrong).
So how do I approach the above problem iteratively? /Or any DP state equation which I can possibly understand ,I couldn't understand recursive approach of some similar problems I Read.
I also tried to think on a similar problem TSHIRTS but I couldn't understand the logic behind the recursion.

You don't actually need DP for this one but you can use bit manipulation nicely :) Since A <= 10^6 it means that A has most 7 digits; so you only have to check 7! = 5040 states.
const int A = 514;
const int B = 2;
vector <int> v; //contains digits of A (e.g. 5, 1, 4) this can be done before the recursive function in a while loop.
int rec(int mask, int current_number){
if(mask == (1 << v.size()) - 1){ //no digit left to pick
if(current_number % B == 0) return 1;
else return 0;
}
int ret = 0;
for(int i = 0; i < v.size(); i++){
if(mask & (1 << i)) continue; //this is already picked
ret += rec(mask | (1 << i), current_number * 10 + v[i]);
}
return ret;
}
Note that the reason I didn't use DP here was that current number might differ even if mask is the same; so you can't actually say that the situation has been repeated. Unless you memo-ize mask AND current_number which requires much more space.

Integer division without using the / or * operator

I am going through an algorithms and datastructures textbook and came accross this question:
1-28. Write a function to perform integer division without using
either the / or * operators. Find a fast way to do it.
How can we come up with a fast way to do it?

I like this solution: https://stackoverflow.com/a/34506599/1008519, but I find it somewhat hard to reason about (especially the |-part). This solution makes a little more sense in my head:
var divide = function (dividend, divisor) {
// Handle 0 divisor
if (divisor === 0) {
return NaN;
}
// Handle negative numbers
var isNegative = false;
if (dividend < 0) {
// Change sign
dividend = ~dividend+1;
isNegative = !isNegative;
}
if (divisor < 0) {
// Change sign
divisor = ~divisor+1;
isNegative = !isNegative;
}
/**
* Main algorithm
*/
var result = 1;
var denominator = divisor;
// Double denominator value with bitwise shift until bigger than dividend
while (dividend > denominator) {
denominator <<= 1;
result <<= 1;
}
// Subtract divisor value until denominator is smaller than dividend
while (denominator > dividend) {
denominator -= divisor;
result -= 1;
}
// If one of dividend or divisor was negative, change sign of result
if (isNegative) {
result = ~result+1;
}
return result;
}
We initialize our result to 1 (since we are going to double our denominator until it is bigger than the dividend)
Double our denominator (with bitwise shifts) until it is bigger than the dividend
Since we know our denominator is bigger than our dividend, we can minus our divisor until it is less than our dividend
Return result since denominator is now as close to the result as possible using the divisor
Here are some test runs:
console.log(divide(-16, 3)); // -5
console.log(divide(16, 3)); // 5
console.log(divide(16, 33)); // 0
console.log(divide(16, 0)); // NaN
console.log(divide(384, 15)); // 25
Here is a gist of the solution: https://gist.github.com/mlunoe/e34f14cff4d5c57dd90a5626266c4130

Typically, when an algorithms textbook says fast they mean in terms of computational complexity. That is, the number of operations per bit of input. In general, they don't care about constants, so if you have an input of n bits, whether it takes two operations per bit or a hundred operations per bit, we say the algorithm takes O(n) time. This is because if we have an algorithm that runs in O(n^2) time (polynomial... in this case, square time) and we imagine a O(n) algorithm that does 100 operations per bit compared to our algorithm which may do 1 operation per bit, once the input size is 100 bits, the polynomial algorithm starts to run really slow really quickly (compared to our other algorithm). Essentially, you can imagine two lines, y=100x and y=x^2. Your teacher probably made you do an exercise in Algebra (maybe it was calculus?) where you have to say which one is bigger as x approaches infinity. This is actually a key concept in divergence/convergence in calculus if you have gotten there already in mathematics. Regardless, with a little algebra, you can imagine our graphs intersecting at x=100, and y=x^2 being larger for all points where x is greater than 100.
As far as most textbooks are concerned, O(nlgn) or better is considered "fast". One example of a really bad algorithm to solve this problem would be the following:
crappyMultiplicationAlg(int a, int b)
int product = 0
for (b>0)
product = product + a
b = b-1
return product
This algorithm basically uses "b" as a counter and just keeps adding "a" to some variable for each time b counts down. To calculate how "fast" the algorithm is (in terms of algorithmic complexity) we count how many runs different components will take. In this case, we only have a for loop and some initialization (which is negligible in this case, ignore it). How many times does the for loop run? You may be saying "Hey, guy! It only runs 'b' times! That may not even be half the input. Thats way better than O(n) time!"
The trick here, is that we are concerned with the size of the input in terms of storage... and we all (should) know that to store an n bit integer, we need lgn bits. In other words, if we have x bits, we can store any (unsigned) number up to (2^x)-1. As a result, if we are using a standard 4 byte integer, that number could be up to 2^32 - 1 which is a number well into the billions, if my memory serves me right. If you dont trust me, run this algorithm with a number like 10,000,000 and see how long it takes. Still not convinced? Use a long to use a number like 1,000,000,000.
Since you didn't ask for help with the algorithm, Ill leave it for you as a homework exercise (not trying to be a jerk, I am a total geek and love algorithm problems). If you need help with it, feel free to ask! I already typed up some hints by accident since I didnt read your question properly at first.
EDIT: I accidentally did a crappy multiplication algorithm. An example of a really terrible division algorithm (i cheated) would be:
AbsolutelyTerribleDivisionAlg(int a, int b)
int quotient = 0
while crappyMultiplicationAlg(int b, int quotient) < a
quotient = quotient + 1
return quotient
This algorithm is bad for a whole bunch of reasons, not the least of which is the use of my crappy multiplication algorithm (which will be called more than once even on a relatively "tame" run). Even if we were allowed to use the * operator though, this is still a really bad algorithm, largely due to the same mechanism used in my awful mult alg.
PS There may be a fence-post error or two in my two algs... i posted them more for conceptual clarity than correctness. No matter how accurate they are at doing multiplication or division, though, never use them. They will give your laptop herpes and then cause it to burn up in a sulfur-y implosion of sadness.

I don't know what you mean by fast...and this seems like a basic question to test your thought process.
A simple function can be use a counter and keep subtracting the divisor from the dividend till it becomes 0. This is O(n) process.
int divide(int n, int d){
int c = 0;
while(1){
n -= d;
if(n >= 0)
c++;
else
break;
}
return c;
}
Another way can be using shift operator, which should do it in log(n) steps.
int divide(int n, int d){
if(d <= 0)
return -1;
int k = d;
int i, c, index=1;
c = 0;
while(n > d){
d <<= 1;
index <<= 1;
}
while(1){
if(k > n)
return c;
if(n >= d){
c |= index;
n -= d;
}
index >>= 1;
d >>= 1;
}
return c;
}
This is just like integer division as we do in High-School Mathematics.
PS: If you need a better explanation, I will. Just post that in comments.
EDIT: edited the code wrt Erobrere's comment.

The simplest way to perform a division is by successive subtractions: subtract b from a as long as a remains positive. The quotient is the number of subtractions performed.
This can be pretty slow, as you will perform q subtractions and tests.
With a=28 and b=3,
28-3-3-3-3-3-3-3-3-3=1
the quotient is 9 and the remainder 1.
The next idea that comes to mind is to subtract several times b in a single go. We can try with 2b or 4b or 8b... as these numbers are easy to compute with additions. We can go as for as possible as long as the multiple of b does not exceed a.
In the example, 2³.3 is the largest multiple which is possible
28>=2³.3
So we subtract 8 times 3 in a single go, getting
28-2³.3=4
Now we continue to reduce the remainder with the lower multiples, 2², 2 and 1, when possible
4-2².3<0
4-2.3 <0
4-1.3 =1
Then our quotient is 2³+1=9 and the remainder 1.
As you can check, every multiple of b is tried once only, and the total number of attempts equals the number of doublings required to reach a. This number is just the number of bits required to write q, which is much smaller than q itself.

This is not the fastest solution, but I think it's readable enough and works:
def weird_div(dividend, divisor):
if divisor == 0:
return None
dend = abs(dividend)
dsor = abs(divisor)
result = 0
# This is the core algorithm, the rest is just for ensuring it works with negatives and 0
while dend >= dsor:
dend -= dsor
result += 1
# Let's handle negative numbers too
if (dividend < 0 and divisor > 0) or (dividend > 0 and divisor < 0):
return -result
else:
return result
# Let's test it:
print("49 divided by 7 is {}".format(weird_div(49,7)))
print("100 divided by 7 is {} (Discards the remainder) ".format(weird_div(100,7)))
print("-49 divided by 7 is {}".format(weird_div(-49,7)))
print("49 divided by -7 is {}".format(weird_div(49,-7)))
print("-49 divided by -7 is {}".format(weird_div(-49,-7)))
print("0 divided by 7 is {}".format(weird_div(0,7)))
print("49 divided by 0 is {}".format(weird_div(49,0)))
It prints the following results:
49 divided by 7 is 7
100 divided by 7 is 14 (Discards the remainder)
-49 divided by 7 is -7
49 divided by -7 is -7
-49 divided by -7 is 7
0 divided by 7 is 0
49 divided by 0 is None

unsigned bitdiv (unsigned a, unsigned d)
{
unsigned res,c;
for (c=d; c <= a; c <<=1) {;}
for (res=0;(c>>=1) >= d; ) {
res <<= 1;
if ( a >= c) { res++; a -= c; }
}
return res;
}

The pseudo code:
count = 0
while (dividend >= divisor)
dividend -= divisor
count++
//Get count, your answer

Optimizing this query based search

We have two N-bit numbers (0< N< 100000). We have to perform q queries (0< q<500000) over these numbers. The query can be of following three types:
set_a idx x: Set A[idx] to x, where 0 <= idx < N, where A[idx] is idx'th least significant bit of A.
set_b idx x: Set B[idx] to x, where 0 <= idx < N.
get_c idx: Print C[idx], where C=A+B, and 0<=idx
Now, I have optimized the code to the best extent I can.
First, I tried with an int array for a, b and c. For every update, I calculate c and return the ith bit when queried. It was damn slow. Cleared 4/11 test cases only.
I moved over to using boolean array. It was around 2 times faster than int array approach. Cleared 7/11 testcases.
Next, I figured out that I need not calculate c for calculating idx th bit of A+B. I will just scan A and B towards right from idx until I find either a[i]=b[i]=0 or a[i]=b[i]=1. If a[i]=b[i]=0, then I just add up towards left to idx th bit starting with initial carry=0. And if a[i]=b[i]=1, then I just add up towards left to idx th bit starting with initial carry=1.
This was faster but cleared only 8/11 testcases.
Then, I figured out once, I get to the position i, a[i]=b[i]=0 or a[i]=b[i]=1, then I need not add up towards idx th position. If a[i]=b[i]=0, then answer is (a[idx]+b[idx])%2 and if a[i]=b[i]=1, then the answer is (a[idx]+b[idx]+1)%2. It was around 40% faster but still cleared only 8/11 testcases.
Now my question is how do get down those 3 'hard' testcases? I dont know what they are but the program is taking >3 sec to solve the problem.
Here is the code: http://ideone.com/LopZf

One possible optimization is to replace
(a[pos]+b[pos]+carry)%2
with
a[pos]^b[pos]^carry
The XOR operator (^) performs addition modulo 2, making the potentially expensive mod operation (%) unnecessary. Depending on the language and compiler, the compiler may make optimizations for you when doing a mod with a power of 2. But since you are micro-optimizing it is a simple change to make that removes dependence on that optimization being made for you behind the scenes.
http://en.wikipedia.org/wiki/Exclusive_or
This is just one suggestion that is simple to make. As others have suggested, using packed ints to represent your bit array will likely also improve what is probably the worst case test for your code. That would be the get_c function of the most significant bit, with either A or B (but not both) being 1 for all the other positions, requiring a scan of every bit position to the least significant bit to determine carry. If you were using packed ints for your bits, there would only be approximately 1/32 as many operations neccessary (assuming 32 bit ints). Using packed ints however would be a somewhat more complicated than your use of a simple boolean array (which really is likely just an array of bytes).
C/C++ Bit Array or Bit Vector
Convert bit array to uint or similar packed value
http://en.wikipedia.org/wiki/Bit_array
There are lots of other examples on Stackoverflow and the net for using ints as if they were bit arrays.

Here is a solution that looks a bit like your algorithm. I demonstrate it with bytes, but of course you can easily optimize the algorithm using 32 bit words (I suppose your machine has 64 bits arithmetic nowadays).
void setbit( unsigned char*x,unsigned int idx,unsigned int bit)
{
unsigned int digitIndex = idx>>3;
unsigned int bitIndex = idx & 7;
if( ((x[digitIndex]>>bitIndex)&1) ^ bit) x[digitIndex]^=(1u<<bitIndex);
}
unsigned int getbit(unsigned char *a,unsigned char *b,unsigned int idx)
{
unsigned int digitIndex = idx>>3;
unsigned int bitIndex = idx & 7;
unsigned int c = a[digitIndex]+b[digitIndex];
unsigned int bit = (c>>bitIndex) & 1;
/* a zero bit on the right will absorb a carry, let's check if any */
if( (c^(c+1))>>bitIndex )
{
/* none, we must check if there's a carry propagating from the right digits */
for(;digitIndex-- > 0;)
{
c=a[digitIndex]+b[digitIndex];
if( c > 255 ) return bit^1; /* yes, a carry */
if( c < 255 ) return bit; /* no carry possible, a zero bit will absorb it */
}
}
return bit;
}
If you find anything cryptic, just ask.
Edit: oops, I inverted the zero bit condition...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio