Compare two numbers for "likeness" - algorithm

This is part of a search function on a website. So im trying to find a way to get to the end result as fast as possible.
Have a binary number where digit order matters.
Input Number = 01001
Have a database of other binary numbers all the same length.
01000, 10110, 00000, 11111
I dont know how to write what im doing, so im going to do it more visually below.
// Zeros mean nothing & the location of a 1 matters, not the total number of 1's.
input num > 0 1 0 0 1 = 2 possible matches
number[1] > 0 1 0 0 0 = 1 match = 50% match
number[2] > 1 0 1 1 0 = 0 match = 0% match
number[3] > 0 0 0 0 0 = 0 match = 0% match
number[4] > 1 1 1 1 1 = 2 match = 100% match
Now obviously, you could go digit by digit, number by number and compare it that way (using a loop and what not). But I was hoping there might be an algorithm or something that will help. Mostly because in the above example I only used 5 digit numbers. But im going to be routinely comparing around 100,000 numbers with 200 digits each, that's a lot of calculating.
I usually deal with php and MySQL. But if something spectacular comes up I could always learn.

If it's possible to somehow chop up your bitstrings in integer-size chunks some elementary boolean arithmetic would do, and that kind of instructions is generally pretty fast
$matchmask = ~ ($inputval ^ $tomatch) & $inputval
What this does:
the xor determines the bits that are different in the inputval and tomatch
negation gives a value where all bits that are equal in inputval and tomatch are set
and that with inputval and only the bits that are 1 in both inputval and tomatch remain set.
Then count the number of bits set in the result, look at How to count the number of set bits in a 32-bit integer? for an optimal solution, easily translated into php

Instead of checking each bit, you could pre-process the input and determine which bits need checking. In the worst case, this devolves into processing each bit, but for a normal distribution, you'll save some processing.
That is, for input
01001, iterate over the database and determine if number1[0] & input is non-zero, and (number1[3] >> 8) & input is non-zero, assuming 0 as the index of the LSB. How you get fast bit-shifting and anding with the large numbers is on you, however. If you detect 1s than 0s in the input, you could always invert the input and test for zero to detect coverage.
This will give you modest improvement, but it's at best a constant-time reduction of the problem. If most of your inputs are balanced between 0s and 1s, you'll halve the number of required operations. If it's more biased, you'll get better results.

Well, the first thing I can think of is a simple bitwise AND between the two numbers; you can then analyze the result to get the match percentage:
if( result >= input )
//100% match
else {
result ^= input;
/* The number of 1's in result is the number of 1 of "input"
* that are missing in "result".
*/
}
Of course, you'll need to implement your own AND and XOR function (this will work only for 32 bit integers). Note that it works only with unsigned numbers.

Suppose the input number is called A (so in your example A = 01001) and the other number is x. You'll have 100% match when x & A == A. Otherwise, for partial matches, the number of 1 bits will be (taken from hacker's delight):
x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
x = (x & 0x0000FFFF) + ((x >>16) & 0x0000FFFF);
Note this will work for 32 bits integers.

Let's assume you have a function bit1count, then from what you describe, the "likeness" formula should be:
100.0 / min(bit1count(n1), bit1count(n2)) * bit1count(n1 & n2)
With n1 and n2 being the two numbers and & being the logical and operator.
bit1count can be easily implemented using a loop, or, more elegant, using the algorithm provided in BigBears answer.
There is actually a BIT_COUNT in mysql, so something like this should work:
SELECT 100.0 / IF(BIT_COUNT(n1) < BIT_COUNT(n2), BIT_COUNT(n1), BIT_COUNT(n2)) * BIT_COUNT(n1 & n2) FROM table

Related

Verify that a number can be decomposed into powers of 2

Is it possible to verify that a number can be decomposed into a sum of powers of 2 where the exponents are sequential?
Is there an algorithm to check this?
Example: where and
The binary representation would have a single, consecutive group of 1 bits.
To check this, you could first identify the value of the least significant bit, add that bit to the original value, and then check whether the result is a power of 2.
This leads to the following formula for a given x:
(x & (x + (x & -x))) == 0
This expression is also true when x is zero. If that case needs to be rejected as a solution, you need an extra condition for that.
In Python:
def f(x):
return x > 0 and (x & (x + (x & -x))) == 0
This can be done in an elegant way using bitwise operations to check whether the binary representation of the number is a single block of consecutive 1 bits, followed by perhaps some 0s.
The expression x & (x - 1) replaces the lowest 1 in the binary representation of x with a 0. If we call that number y, then y | (y >> 1) sets each bit to be a 1 if it had a 1 to its immediate left. If the original number x was a single block of consecutive 1 bits, then the result is the same as the number x that we started with, because the 1 which was removed will be replaced by the shift. On the other hand, if x is not a single block of consecutive 1 bits, then the shift will add at least one other 1 bit that wasn't there in the original x, and they won't be equal.
That works if x has more than one 1 bit, so the shift can put back the one that was removed. If x has only a single 1 bit, then removing it will result in y being zero. So we can check for that, too.
In Python:
def is_sum_of_consecutive_powers_of_two(x):
y = x & (x - 1)
z = y | (y >> 1)
return x == z or y == 0
Note that this returns True when x is zero, and that's the correct result if "a sum of consecutive powers of two" is allowed to be the empty sum. Otherwise, you will have to write a special case to reject zero.
A number can be represented as the sum of powers of 2 with sequential exponents iff its binary representation has all 1s adjacent.
E.g. the set of numbers that can be represented as 2^n + 2^n-1, n >= 1, is exactly those with two adjacent ones in the binary representation.
just like this:
bool check(int x) {/*the number you want to check*/
int flag = 0;
while (x >>= 1) {
if (x & 1) {
if (!flag) flag = 1;
if (flag == 2) return false;
}
if (flag == 1) flag = 2;
}
return true;
}
O(log n).

Shifting and Masking Binary Bits

I've came across this snippet of code on a book:
public static short countBits(int x) {
short numBit = 0;
while(x != 0) {
numBit += (x&1);
x >>>= 1;
}
return numBit;
}
However, I'm not really sure how numBit += (x&1); and x >>>= 1 works.
I think that numBit += (x&1) is comparing AND for a single digit and 1. Does it mean that if my binary number is 10001, the function is ANDing the 1000"1" bit with 1 on the first iteration of the while loop?
Also, what's the point of >>>= 1 ? I think that ">>>" is shifting the bits to the right by three but I can't figure out the purpose of doing so in this function.
Any help would be much appreciated. Thank you!
This function counts the number of bits that are set to "1". x & 1 is a bitwise-AND with the least significant bit of x's current value (either 1 if x is odd, or 0 if it's even). As such it makes perfect sense to add it to result. x >>>= 1 is equivalent to x = x >> 1 and this means "shift bits in x by 1 position to the right" (or, for unsigned integers, divide x by 2).

Is there a name for this algorithm? (I've been calling it changeBinary)

Is there a name for this algorithm? (I've been calling it changeBinary)
DESCRIPTION:
You take a binary string as input.
The first bit of the output is the same as the first bit of the input.
Every bit after that is 0 if the bit at that index of the input string is the same as the bit at the previous index in the input string. Otherwise, it's 1.
For example,
Input: 00011000001010100001001000010011
Output: 00010100001111110001101100011010
Here is a simple javascript implementation:
var changeBinary = function(binaryString){
var output = binaryString[0] === '0' ? '0' : 1;
for (var i = 1; i < binaryString.length; i++){
var nextBit = binaryString[i] === binaryString[i - 1] ? '0' : '1';
output += nextBit;
}
return output;
}
OBSERVATIONS:
First, it seems that if you keep applying the algorithm to a string, it eventually returns to its original value. Second, it the number of iterations it takes to do so seems to always be a power of 2 (including 2^0 = 1). For example, if you apply the changeBinary function above 32 times to the string above, it will return to the original value.
Has anyone ever encountered this before, and if so, do you know of any other information about it?
It just seems to me like this is something so simple and basic that someone must have studied it more in depth.
Any feedback would be greatly appreciated.
It may be interesting to know that this is x ^ (x << 1) on a BigInteger (or, if you limit the length of the strings, the same thing but on a fixed-size integer), also describable as clmul(x, 3).
Carryless multiplication, which is essentially just like normal multiplication, but instead of adding the partial products you XOR them, has some fairly nice properties, such as being commutative and associative. The associative property is especially of interest since it allows you to reason easily about what composing your algorithm with itself a couple of times does: for example
changeBinary o changeBinary is clmul(clmul(x, 3), 3) = clmul(x, clmul(3, 3)) = clmul(x, 5)
That it's a carryless multiplication by 3 also explains why it "undoes" itself when applied often enough, as the carryless multiplicative inverse of 3 is the number with all bits set, which with 32 bits is 0xffffffff, which can be formed as 331 (with carryless exponentiation). This also follows from the equivalence of a carryless square to a "bit-spread", so it takes a bit string abcd to a0b0c0d, and thus clpow(3, 32) = 1 - 5 spreads have spread the bits so far apart that only the original lsb is left over, the rest does not fit in a 32bit number.
And that also gives a faster inversion, because the number with all bits set can be decomposed into small number of (carryless) factors:
3 x 5 x 17 x 257 x 65537 ...
With a number of factors that is the base two logarithm of the number of bits (rounded up).
Since x ^ (x >> 1) converts a number to Gray Code, I suppose you might call this a "mirrored" Gray Code. The same trick with the factors is used "in the mirror image" to convert a Gray Code back to binary:
x ^= x >> 1 // this is like a "mirror" of x = clmul(x, 3)
x ^= x >> 2 // 5
x ^= x >> 4 // 17
x ^= x >> 8
x ^= x >> 16
Here we just flip the direction of the shift to get:
x ^= x << 1
x ^= x << 2
x ^= x << 4
x ^= x << 8
x ^= x << 16
Which is clmul(x, 0xffffffff) and has also been called PS-XOR(x)
The algorithm you described is an example of Delta Encoding.

Most efficient way to evaluate a binary scalar product mod 2

I am currently performing Fourier transforms for some physics problem, and a huge bottleneck of my algorithm comes from the evaluation of a scalar product modulo 2.
For a given integer N, I have to represent all the numbers in binary up to 2^N-1.
For each of these numbers, represented as a binary vector (e.g. 15 = 2^3 + 2^2 +2+2^0 = (1,1,1,1,0,...,0)) I have to evaluate its scalar products with all numbers from 0 to 2^N-1 in binary form modulo 2.
(for example, the scalar product 1.15 =(1,0,0,...,0).(1,1,1,1,0,...,0)=1*1+1*0+...=1 mod 2)
Note that the components are kept in binary form during the reducing modulo 2
(1,1).(1,1)=1*1+1*1 and not 1*1+2*2
This is basically 2^(2N) scalar products that I have to perform and reduce modulo 2.
I am having difficulty to get more than N = 18.
I was wondering whether some clever mathematical trick can be used to greatly reduce the time spent doing them.
I was thinking of some kind of recursion (i.e. saving results for N in a file and deduce the results for N+1) but I am not sure this would help. Indeed, with this recursion, knowing the results for N, I could cut the vector for N+1 corresponding to the N part plus an additional digit, but then at each scalar product, instead of evaluating the scalar product, I would have to tell my computer to go and read a big file (because I probably wouldn't be able to keep it all in dynamic memory), which is probably time-consuming, perhaps more than the ~20 multiplications I have to perform for each of the products.
Is there any known optimized number-theoretical algorithm allowing the evaluation of such a scalar product modulo 2 very quickly ? Are there any rules or ideas I am not aware of that I could exploit ?
Sorry for the terrible formatting, I just can't get LateX to work in here.
The sum of the product of corresponding bits, modulo 2, will be equal to the number of 1 bits in the AND of the two numbers, modulo 2.
As you can get the binary representation of a number easily, it might not be necessary to actually create an array of bits for them, but just use the integer data type in your programming language, which allows for at least 32 bits. Many languages offer bit operators, such as a AND (&) and XOR (^).
Counting the 1 bits in a number can be done with the variable-precision SWAR algorithm.
Here is program in Python that calculates this product modulo 2 for 2 numbers:
def numberOfSetBits(i):
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
def product(a, b):
return numberOfSetBits(a & b) % 2
Instead of counting the bits with numberOfSetBits, you could fold the bits together with XORs, first the 16 most significant bits with the 16 least significant bits, then of that result the 8 most significant with the 8 least significant bits, until you have one bit left. Again in Python:
def bitParity(i):
i = (i >> 16) ^ i
i = (i >> 8) ^ i
i = (i >> 4) ^ i
i = (i >> 2) ^ i
i = (i >> 1) ^ i
return i % 2
def product(a, b):
return bitParity(a & b)
If you change the order that you are evaluating these pairs (a matrix of size 2n x 2n), then you can efficiently figure out which products-mod-2 change in each row of your evaluation.
Using Gray code, you can iterate over each value from 0 ... 2n-1 in a special order where only 1 bit of the outer-loop value changes each time. You can store 1 bit for each value from 0 ... 2n-1 representing the previous row's product-mod-2 values, and then change it based on whether the changing bit has any effect, which it only does when the corresponding bit in the other (inner loop) number is 1 (if it's 0 then the binary AND will be 0 no matter what the value of the other bit).
In C:
int N = 5;
int max = (1 << N) - 1;
unsigned char* prev = calloc((1 << N) / 8, 1);
// for the first row all the products will be zero, so start at row 1
for(int a = 1; a <= max; a++)
{
int grey = a ^ (a >> 1); // compute the grey code
int prev_grey = (a - 1) ^ ((a - 1) >> 1);
int changed_bit = grey ^ prev_grey;
for(int b = 0; b <= max; b++)
{
// the product will be changed only if b has a 1 at the same place
// (otherwise it will be 0 regardless)
if(b & changed_bit)
{
prev[b >> 3] ^= (1 << (b & 7));
}
int mod = (prev[b >> 3] & (1 << (b & 7))) != 0;
printf("mod value of %d and %d is %d\n", grey, b, mod);
}
}
The inner loop can be optimized even more because you can easily figure out which values of b have a non-zero value in the position of the changed bit: for example if it's in position 10 then there will be runs of 1024 in a row of 0 then 1 etc. So you know that you have 1024 values where the product-mod-2 is the same as in the previous row etc. It's not clear to me if this helps you though because I don't know what you are doing with these products.
The inner loop could also be unrolled (e.g. 32 or 64 times) so that you don't read and write to the prev array each time, but rather process blocks of 32 or 64 bits at a time.

Rational comparison of bits

I have a number of type int. It lies within [0,255]. That is, includes 8 bits. I need to check often say:
2(int) = 00000010(Binary)
1. The bit 6 and bit 7 must be equal to 0 and 1 respectively.
And I check it like this:
if ((!(informationOctet_ & (1 << 6))) && (informationOctet_ & (1 << 7)))
{
...
}
But it is not very readable, whether it is possible - to do something "beautiful"?
I can not use the std::bitset, my head says it's a waste of resources and you can not do without it.
There are two reasonable solutions: either set all don't-care bits to zero, and then test the result, or set the don't-care bits to one and test the result:
(x & 0xC0) == 0x80
(x | ~0xC0) == ~0x40
As harold pointed out in the comment, the first form is far more common. This pattern is so common that the optimizer of your compiler will recognize it.
Other forms exist, but they're obscure: ((x ^ 0x80) & 0xC0 == 0) works just as well but is less clear. Some ISA's cannot load large constants directly, so they use the equivalent of ((x>>6) & 0x3) == 0x2. Don't bother with this, your optimizer will.
You can apply some masking techniques as,
int i = 246; // Lets say any value.
int chk = ( i & 00000110b ); // eliminates all other bits except 6th & 7th bit
if (chk == 2) // because we want to check 6th bit is 0 & 7th is 1, that becomes 2 value in decimal
printf("The 6th bit is 0 & 7th bit is 1");
else
printf("Either 6th bit is not 0 or 7th bit is not 1, or both are not 0 & 1 respectivly");

Resources