Convert integer matrix to bitstring?

I have a 4x4 integer matrix (called tb) which I can create from a int64_t bitstring (called state) as follows:
for(int i = 0; i < 4; i++) {
for(int j = 0; j < 4; j++) {
ipos -= 1;
tb[i][j] = (state >> (4*_pos)) & 0xf);
Once I start with a matrix, however, how can I change it to a bitstring? I was hoping to go through the integer matrix, get the element, create a 4 bit hex representation of it, then shift it over (<<4) the correct number of times and bitwise or (|) the bitstring with the new state bitstring, but I'm not sure how to do this or if it is the best way. Ideas?

Sure, just do it exactly how you said, something like this (not tested)
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
state |= (uint64_t)tb[i][j] << (4 * pos);
This has a fairly long dependency chain, that's not great especially if you're in HPC. You could chop it up into parts, say the first half and the second half, then combine them in the end. As a bonus that means the shifts operate on 32 bits rather than 64, which may be faster on some platforms.
Depending on the type of tb there may be other tricks, for example if every entry is a byte and you can alias it with two uint64_t's, then you can combine the entries using straight line bitmanipulation (though they are "reversed" compared to the most convenient order).
For example, maybe something like this (not tested) (this assumes the ordering is reversed, it can also be done with the same order)
uint64_t low, high; // inputs
uint64_t even = 0x00FF00FF00FF00FFULL;
uint64_t odd = ~even;
low = (low & even) | ((low & odd) >> 4);
high = (high & even) | ((high & odd) >> 4);
even = 0x0000FFFF0000FFFFULL;
odd = ~even;
low = (low & even) | ((low & odd) >> 8);
high = (high & even) | ((high & odd) >> 8);
low = (low & 0xFFFF) | (low >> 16);
high = (high & 0xFFFF) | (high >> 16);
return low | (high << 32);
If you allow special instructions there is an even shorter way, (not tested, and again reverses the order)
low = _pext_u64(low, 0x0F0F0F0F0F0F0F0FULL);
high = _pext_u64(high, 0x0F0F0F0F0F0F0F0FULL);
return low | (high << 32);
The related conversion the other way is equally simple,
low = _pdep_u64(bitstring & 0xFFFFFFFF, 0x0F0F0F0F0F0F0F0FULL);
high = _pdep_u64(bitstring >> 32, 0x0F0F0F0F0F0F0F0FULL);
Both of these also apply to the reversed order if you just reverse the nibbles first, which can be done with bitmanipulation as well.


How double-hashing works in leveldb?

When I read the implementation of bloom filter in leveldb, I found it use a trick to apply double-hashing.
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos/8] |= (1 << (bitpos % 8));
h += delta;
There is no second hash function, but just rotate 17 bits.
I'm wondering how the 17 chosen?
And, if I want to use the same trick to a uint64_t, how many bits should I rotate?

Calculate the number of unordered pairs in an array whose bitwise "AND" is a power of 2 in O(n) or O(n*log(n))

How to calculate number of unordered pairs in an array whose bitwise AND is a power of 2. For ex if the array is [10,7,2,8,3]. The answer is 6.
Explanation(0-based index):
a[0]&a[1] = 2
a[0]&a[2] = 2
a[0]&a[3] = 8
a[0]&a[4] = 2
a[1]&a[2] = 2
a[2]&a[4] = 2
The only approach that comes to my mind is brute force. How to optimize it to perform in O(n) or O(n*log(n))?
The constraints on the size of array can be at max 10^5. And the value in that array can be upto 10^12.
Here is the brute force code that I tried.
int ans = 0;
for (int i = 0; i < a.length; i++) {
for (int j = i + 1; j < a.length; j++) {
long and = a[i] & a[j];
if ((and & (and - 1)) == 0 && and != 0)
Although this answer is for a smaller range constraint (possibly suited up to about 2^20), I thought I'd add it since it may add some useful information.
We can adapt the bit-subset dynamic programming idea to have a solution with O(2^N * N^2 + n * N) complexity, where N is the number of bits in the range, and n is the number of elements in the list. (So if the integers were restricted to [1, 1048576] or 2^20, with n at 100,000, we would have on the order of 2^20 * 20^2 + 100000*20 = 421,430,400 iterations.)
The idea is that we want to count instances for which we have overlapping bit subsets, with the twist of adding a fixed set bit. Given Ai -- for simplicity, take 6 = b110 -- if we were to find all partners that AND to zero, we'd take Ai's negation,
110 -> ~110 -> 001
Now we can build a dynamic program that takes a diminishing mask, starting with the full number and diminishing the mask towards the left
Each set bit on the negation of Ai represents a zero, which can be ANDed with either 1 or 0 to the same effect. Each unset bit on the negation of Ai represents a set bit in Ai, which we'd like to pair only with zeros, except for a single set bit.
We construct this set bit by examining each possibility separately. So where to count pairs that would AND with Ai to zero, we'd do something like
001 ->
we now want to enumerate
011 ->
101 ->
fixing a single bit each time.
We can achieve this by adding a dimension to the inner iteration. When the mask does have a set bit at the end, we "fix" the relevant bit by counting only the result for the previous DP cell that would have the bit set, and not the usual union of subsets that could either have that bit set or not.
Here is some JavaScript code to demonstrate with testing at the end comparing to the brute-force solution.
var debug = 0;
function bruteForce(a){
let answer = 0;
for (let i = 0; i < a.length; i++) {
for (let j = i + 1; j < a.length; j++) {
let and = a[i] & a[j];
if ((and & (and - 1)) == 0 && and != 0){
if (debug)
console.log(a[i], a[j], a[i].toString(2), a[j].toString(2))
return answer;
function f(A, N){
const n = A.length;
const hash = {};
const dp = new Array(1 << N);
for (let i=0; i<1<<N; i++){
dp[i] = new Array(N + 1);
for (let j=0; j<N+1; j++)
dp[i][j] = new Array(N + 1).fill(0);
for (let i=0; i<n; i++){
if (hash.hasOwnProperty(A[i]))
hash[A[i]] = hash[A[i]] + 1;
hash[A[i]] = 1;
for (let mask=0; mask<1<<N; mask++){
// j is an index where we fix a 1
for (let j=0; j<=N; j++){
if (mask & 1){
if (j == 0)
dp[mask][j][0] = hash[mask] || 0;
dp[mask][j][0] = (hash[mask] || 0) + (hash[mask ^ 1] || 0);
} else {
dp[mask][j][0] = hash[mask] || 0;
for (let i=1; i<=N; i++){
if (mask & (1 << i)){
if (j == i)
dp[mask][j][i] = dp[mask][j][i-1];
dp[mask][j][i] = dp[mask][j][i-1] + dp[mask ^ (1 << i)][j][i - 1];
} else {
dp[mask][j][i] = dp[mask][j][i-1];
let answer = 0;
for (let i=0; i<n; i++){
for (let j=0; j<N; j++)
if (A[i] & (1 << j))
answer += dp[((1 << N) - 1) ^ A[i] | (1 << j)][j][N];
for (let i=0; i<N + 1; i++)
if (hash[1 << i])
answer = answer - hash[1 << i];
return answer / 2;
var As = [
[5, 4, 1, 6], // 4
[10, 7, 2, 8, 3], // 6
[2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 6, 7, 8, 9]
for (let A of As){
console.log(`DP, brute force: ${ f(A, 4) }, ${ bruteForce(A) }`);
var numTests = 1000;
for (let i=0; i<numTests; i++){
const N = 6;
const A = [];
const n = 10;
for (let j=0; j<n; j++){
const num = Math.floor(Math.random() * (1 << N));
const fA = f(A, N);
const brute = bruteForce(A);
if (fA != brute){
console.log(fA, brute);
console.log("Done testing.");
Transform your array of values into an array of index sets, where each set corresponds to a particular bit and contains the indexes of the value from the original set that have the bit set. For example, your example array A = [10,7,2,8,3] becomes B = [{1,4}, {0,1,2,4}, {1}, {0,3}]. A fixed-sized array of bitvectors is an ideal data structure for this, as it makes set union/intersection/setminus relatively easy and efficient.
Once you have that array of sets B (takes O(nm) time where m is the size of your integers in bits), iterate over every element i of A again, computing ∑j|Bj&setminus;i&setminus;&bigcup;kBk:k≠j∧i&in;Bk|:i&in;Bj. Add those all together and divide by 2, and that should be the number of pairs (the "divide by 2" is because this counts each pair twice, as what it is counting is the number of numbers each number pairs with). Should only take O(nm2) assuming you count the setminus operations as O(1) -- if you count them as O(n), then you're back to O(n2), but at least your constant factor should be small if you have efficient bitsets.
foreach A[i] in A:
foreach bit in A[i]:
B[bit] += {i}
pairs = 0
foreach A[i] in A:
foreach B[j] in B:
if i in B[j]:
tmp = B[j] - {i}
foreach B[k] in B:
if k != j && i in B[k]:
tmp -= B[k]
pairs += |tmp|
return pairs/2

Portable efficient alternative to PDEP without using BMI2?

The documentation for the parallel deposit instruction (PDEP) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):
U64 _pdep_u64(U64 val, U64 mask) {
U64 res = 0;
for (U64 bb = 1; mask; bb += bb) {
if (val & bb)
res |= mask & -mask;
mask &= mask - 1;
return res;
See also Intel's pdep insn ref manual entry.
This algorithm is O(n), where n is the number of set bits in mask, which obviously has a worst case of O(k) where k is the total number of bits in mask.
Is a more efficient worst case algorithm possible?
Is it possible to make a faster version that assumes that val has at most one bit set, ie either equals 0 or equals 1<<r for some value of r from 0 to 63?
The second part of the question, about the special case of a 1-bit deposit, requires two steps. In the first step, we need to determine the bit index r of the single 1-bit in val, with a suitable response in case val is zero. This can easily be accomplished via the POSIX function ffs, or if r is known by other means, as alluded to by the asker in comments. In the second step we need to identify bit index i of the r-th 1-bit in mask, if it exists. We can then deposit the r-th bit of val at bit i.
One way of finding the index of the r-th 1-bit in mask is to tally the 1-bits using a classical population count algorithm based on binary partitioning, and record all of the intermediate group-wise bit counts. We then perform a binary search on the recorded bit-count data to identify the position of the desired bit.
The following C-code demonstrates this using 64-bit data. Whether this is actually faster than the iterative method will very much depend on typical values of mask and val.
#include <stdint.h>
/* Find the index of the n-th 1-bit in mask, n >= 0
The index of the least significant bit is 0
Return -1 if there is no such bit
int find_nth_set_bit (uint64_t mask, int n)
int t, i = n, r = 0;
const uint64_t m1 = 0x5555555555555555ULL; // even bits
const uint64_t m2 = 0x3333333333333333ULL; // even 2-bit groups
const uint64_t m4 = 0x0f0f0f0f0f0f0f0fULL; // even nibbles
const uint64_t m8 = 0x00ff00ff00ff00ffULL; // even bytes
uint64_t c1 = mask;
uint64_t c2 = c1 - ((c1 >> 1) & m1);
uint64_t c4 = ((c2 >> 2) & m2) + (c2 & m2);
uint64_t c8 = ((c4 >> 4) + c4) & m4;
uint64_t c16 = ((c8 >> 8) + c8) & m8;
uint64_t c32 = (c16 >> 16) + c16;
int c64 = (int)(((c32 >> 32) + c32) & 0x7f);
t = (c32 ) & 0x3f; if (i >= t) { r += 32; i -= t; }
t = (c16>> r) & 0x1f; if (i >= t) { r += 16; i -= t; }
t = (c8 >> r) & 0x0f; if (i >= t) { r += 8; i -= t; }
t = (c4 >> r) & 0x07; if (i >= t) { r += 4; i -= t; }
t = (c2 >> r) & 0x03; if (i >= t) { r += 2; i -= t; }
t = (c1 >> r) & 0x01; if (i >= t) { r += 1; }
if (n >= c64) r = -1;
return r;
/* val is either zero or has a single 1-bit.
Return -1 if val is zero, otherwise the index of the 1-bit
The index of the least significant bit is 0
int find_bit_index (uint64_t val)
return ffsll (val) - 1;
uint64_t deposit_single_bit (uint64_t val, uint64_t mask)
uint64_t res = (uint64_t)0;
int r = find_bit_index (val);
if (r >= 0) {
int i = find_nth_set_bit (mask, r);
if (i >= 0) res = (uint64_t)1 << i;
return res;

generate all n bit binary numbers in a fastest way possible

How do I generate all possible combinations of n-bit strings? I need to generate all combinations of 20-bit strings in a fastest way possible. (my current implementation is done with bitwise AND and right shift operation, but I am looking for a faster technique).
I need to store the bit-strings in an array (or list) for the corresponding decimal numbers, like --
0 --> 0 0 0
1 --> 0 0 1
2 --> 0 1 0 ... etc.
any idea?
>> n = 3
>> l = [bin(x)[2:].rjust(n, '0') for x in range(2**n)]
>> print l
['000', '001', '010', '011', '100', '101', '110', '111']
for (unsigned long i = 0; i < (1<<20); ++i) {
// do something with it
An unsigned long is a sequence of bits.
If what you want is a string of characters '0' and '1', then you could convert i to that format each time. You might be able to get a speed-up taking advantage of the fact that consecutive numbers normally share a long initial substring. So you could do something like this:
char bitstring[21];
for (unsigned int i = 0; i < (1<<10); ++i) {
write_bitstring10(i, bitstring);
for (unsigned int j = 0; j < (1<<10); ++j) {
write_bitstring10(j, bitstring + 10);
// do something with bitstring
I've only increased from 1 loop to 2 there, but I do a little over 50% as much converting from bits to chars as before. You could experiment with the following:
use even more loops
split the loops unevenly, maybe 15-5 instead of 10-10
write a function that takes a string of zeros and ones, and adds 1 to it. It's pretty easy: find the last '0', change it to a '1', and change all the '1's after it to '0'.
To fiendishly optimize write_bitstring, multiples of 4 are good because on most architectures you can blit 4 characters at a time in a word write:
To start:
assert(CHAR_BIT == 8);
uint32_t bitstring[21 / 4]; // not char array, we need to ensure alignment
((char*)bitstring)[20] = 0; // nul terminate
Function definition:
const uint32_t little_endian_lookup = {
('0' << 24) | ('0' << 16) | ('0' << 8) | ('0' << 0),
('1' << 24) | ('0' << 16) | ('0' << 8) | ('0' << 0),
('1' << 24) | ('1' << 16) | ('0' << 8) | ('0' << 0),
// etc.
// might need big-endian version too
#define lookup little_endian_lookup // example of configuration
void write_bitstring20(unsigned long value, uint32_t *dst) {
dst[0] = lookup[(value & 0xF0000) >> 16];
dst[1] = lookup[(value & 0x0F000) >> 12];
dst[2] = lookup[(value & 0x00F00) >> 8];
dst[3] = lookup[(value & 0x000F0) >> 4];
dst[4] = lookup[(value & 0x0000F)];
I haven't tested any of this: obviously you're responsible for writing a benchmark that you can use to experiment.
Just output numbers from 0 to 2^n - 1 in binary representation with exactly n digits.
for (i = 0; i < 1048576; i++) {
printf('%d', i);
conversion of the int version i to binary string left as an exercise to the OP.
This solution is in Python. (versions 2.7 and 3.x should work)
>>> from pprint import pprint as pp
>>> def int2bits(n):
return [(i, '{i:0>{n}b}'.format(i=i, n=n)) for i in range(2**n)]
>>> pp(int2bits(n=4))
[(0, '0000'),
(1, '0001'),
(2, '0010'),
(3, '0011'),
(4, '0100'),
(5, '0101'),
(6, '0110'),
(7, '0111'),
(8, '1000'),
(9, '1001'),
(10, '1010'),
(11, '1011'),
(12, '1100'),
(13, '1101'),
(14, '1110'),
(15, '1111')]
It finds the width of the maximum number and then pairs the int with the int formatted in binary with every formatted string being right padded with zero's to fill the maximum width if necessary. (The pprint stuff is just to get a neat printout for this forum and could be left out).
you can do it by generate all integer number in binary representation from 0 to 2^n-1
static int[] res;
static int n;
static void Main(string[] args)
n = Convert.ToInt32(Console.ReadLine());
res = new int [n];
static void Generate(int start)
if (start > n)
if(start == n)
for(int i=0; i < start; i++)
Console.Write(res[i] + " ");
for(int i=0; i< 2; i++)
res[start] = i;
Generate(start + 1);

Write a function to divide a number by 3 without using /, % and * operators. itoa() available?

I tried to solve it myself but I could not get any clue.
Please help me to solve this.
Are you supposed to use itoa() for this assignment? Because then you could use that to convert to a base 3 string, drop the last character, and then restore back to base 10.
Using the mathematical relation:
1/3 == Sum[1/2^(2n), {n, 1, Infinity}]
We have
int div3 (int x) {
int64_t blown_up_x = x;
for (int power = 1; power < 32; power += 2)
blown_up_x += ((int64_t)x) << power;
return (int)(blown_up_x >> 33);
If you can only use 32-bit integers,
int div3 (int x) {
int two_third = 0, four_third = 0;
for (int power = 0; power < 31; power += 2) {
four_third += x >> power;
two_third += x >> (power + 1);
return (four_third - two_third) >> 2;
The 4/3 - 2/3 treatment is used because x >> 1 is floor(x/2) instead of round(x/2).
EDIT: Oops, I misread the title's question. Multiply operator is forbidden as well.
Anyway, I believe it's good not to delete this answer for those who didn't know about dividing by non power of two constants.
The solution is to multiply by a magic number and then to extract the 32 leftmost bits:
divide by 3 is equivalent to multiply by 1431655766 and then to shift by 32, in C:
int divideBy3(int n)
return (n * 1431655766) >> 32;
See Hacker's Delight Magic number calculator.
x/3 = e^(ln(x) - ln(3))
Here's a solution implemented in C++:
#include <iostream>
int letUserEnterANumber()
int numberEnteredByUser;
std::cin >> numberEnteredByUser;
return numberEnteredByUser;
int divideByThree(int x)
std::cout << "What is " << x << " divided by 3?" << std::endl;
int answer = 0;
while ( answer + answer + answer != x )
answer = letUserEnterANumber();
if(number<0){ // Edited after comments
number = -(number);
quotient = 0;
while (number-3 >= 0){ //Edited after comments..
number = number-3;
}//after loop exits value in number will give you reminder
EDIT: Tested and working perfectly fine :(
Hope this helped. :-)
long divByThree(int x)
char buf[100];
itoa(x, buf, 3);
buf[ strlen(buf) - 1] = 0;
char* tmp;
long res = strtol(buf, &tmp, 3);
return res;
Sounds like homework :)
I image you can write a function which iteratively divides a number. E.g. you can model what you do with a pen and a piece of paper to divide numbers. Or you can use shift operators and + to figure out if your intermediate results is too small/big and iteratively apply corrections. I'm not going to write down the code though ...
unsigned int div3(unsigned int m) {
unsigned long long n = m;
n += n << 2;
n += n << 4;
n += n << 8;
n += n << 16;
return (n+m) >> 32;
int divideby3(int n)
int x=0;
if(n<3) { return 0; }
return x;
you can use a property from the numbers: A number is divisible by 3 if its sum is divisible by3.
Take the individual digits from itoa() and then use switch function for them recursively with additions and itoa()
Hope this helps
This is very easy, so easy I'm only going to hint at the answer --
Basic boolean logic gates (and,or,not,xor,...) don't do division. Despite this handicap CPUs can do division. Your solution is obvious: find a reference which tells you how to build a divisor with boolean logic and write some code to implement that.
How about this, in some kind of Python like pseudo-code. It divides the answer into an integer part and a fraction part. If you want to convert it to a floating point representation then I am not sure of the best way to do that.
x = <a number>
total = x
intpart = 0
fracpart = 0
% Find the integer part
while total >= 3
total = total - 3
intpart = intpart + 1
% Fraction is what remains
fracpart = total
print "%d / 3 = %d + %d/3" % (x, intpart, fracpart)
Note that this will not work for negative numbers. To fix that you need to modify the algorithm:
total = abs(x)
is_neg = abs(x) != x
if is_neg
print "%d / 3 = -(%d + %d/3)" % (x, intpart, fracpart)
for positive integer division
result = 0
while (result + result + result < input)
result +=1
return result
Convert 1/3 into binary
so 1/3=0.01010101010101010101010101
and then just "multiply" whit this number using shifts and sum
There is a solution posted on
int DividedBy3(int A) {
int p = 0;
for (int i = 2; i <= 32; i += 2)
p += A << i;
return (-p);
Please say something about that, thanks:)
Here's a O(log(n)) way to do it with no bit shifting, so it can handle numbers up-to and including your biggest register size.
(c-style code)
long long unsigned Div3 (long long unsigned n)
// base case:
if (n < 6)
return (n >= 3);
long long unsigned division = 0;
long long unsigned remainder = 0;
// Used for results for only a single power of 2
// Initialise for 2^0
long long unsigned tmp_div = 0;
long long unsigned tmp_rem = 1;
for (long long unsigned pow_2 = 1; pow_2 && (pow_2 <= n); pow_2 += pow_2)
if (n & pow_2)
division += tmp_div;
remainder += tmp_rem;
if (tmp_rem == 1)
tmp_div += tmp_div;
tmp_rem = 2;
tmp_div += tmp_div + 1;
tmp_rem = 1;
return division + Div3(remainder);
It uses recursion, but note that the number drops exponentially in size at each iteration, so the time complexity (TC) is really:
O(TC) = O(log(n) + log(log(n)) + log(log(log(n))) + ... + z)
where z < 6.
Proof that it's O(log(n)):
We note that the number at each recursion strictly decreases (by at least 1):
So series = [log(log(n))] + [log(log(log(n)))] + [...] + [z]) has at most log(log(n)) sums.
series <= log(log(n))*log(log(n))
O(TC) = O(log(n) + log(log(n))*log(log(n)))
Now we note for n sufficiently large:
sqrt(x) > log(x)
x/sqrt(x) > log(x)
x/log(x) > log(x)
x > log(x)*log(x)
So O(x) > O(log(x)*log(x))
Now let x = log(n)
O(log(n)) > O(log(log(n))*log(log(n)))
and given:
O(TC) = O(log(n) + log(log(n))*log(log(n)))
O(TC) = O(log(n))
Slow and naive, but it should work, if an exact divisor exists. Addition is allowed, right?
for number from 1 to input
if number == input+input+input
return number
Extending it for fractional divisors is left as an exercise to the reader.
Basically test for +1 and +2 I think...
