What is an XOR filter? - data-structures

What is an XOR filter? - data-structures

There's a relatively new data structure (2020) called the XOR filter that's being used as a replacement for a Bloom filter.
What is an XOR filter? What advantages does it offer over the Bloom filter? And how does it work?

An XOR filter is designed as a drop-in replacement for a Bloom filter in the case where all the items to store in the filter are known in advance. Like the Bloom filter, it represents an approximation of a set where false negatives are not allowed, but false positives are.
Like a Bloom filter, an XOR filter stores a large array of bits. Unlike a Bloom filter, though, where we think of each bit as being its own array slot, in an XOR filter the bits are grouped together into L-bit sequences, for some parameter L we'll pick later. For example, an XOR filter might look like this:
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Next, we pick three hash functions h1, h2, and h3 that, like a Bloom filter, hash items to slots in the array. Those hash functions let us take an item x and compute its table code, which we do by XORing together the items in spots h1(x), h2(x), and h3(x). An example of this is shown here:
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+
^ ^ ^
| | |
h3(x) h1(x) h2(x)
Table code for x: 10010 xor 01001 xor 10101
= 01110
To complete the picture, we need one more hash function called the fingerprinting function, denoted f(x). The fingerprinting function takes a value as input and outputs an L-bit number called the fingerprint of x. To see whether x is stored in the table, we check whether the table code for x matches the fingerprint for x. If so, we say that x is (probably) in the table. If not, we say that x is (definitely) not in the table.
It's helpful to compare this idea against a Bloom filter. With a Bloom filter, we hash x to a number of positions, then derive a value from those positions by AND-ing everything together, and finally check if the value we got was equal to 1. With an XOR filter, we hash x to three positions, derive a value from those positions by XOR-ing them together, and finally check if the value we got is equal to f(x).
To change the false positive rate for the XOR filter, we simply change the value of L. Specifically, the likelihood that f(x) coincidentally happens to match the XOR of the three locations given by h1(x), h2(x), and h3(x) is 2-L, since that's the probability that a random L-bit value matches another. Therefore, to get a false positive rate of ε, we simply set L = log2 ε-1.
The challenging part is filling in the table. It turns out that there's a really simple strategy for doing so. To store a list of n elements, create a table of size 1.23n. Then, use this recursive procedure:
If there are no items left to place, you're done.
Pick an item x with the following property: x hashes to a table slot (say, slot k) that no other items hash to.
Remove x from the list of items to place and recursively place the remaining items.
Set the value of slot k to a number such that the XOR of the table slots x hashes to equals f(x). (This is always possible: simply XOR together the contents of the other two table slots and f(x), then store that in slot k.)
This procedure has a slight chance of getting stuck in step (2) if every table slot has at least two items hashing to it, but it can be shown that as long as you use at least 1.23n table slots the probability that this occurs is extremely small. If that happens, simply choose new hash functions and try again.
XOR filters have several advantages over regular Bloom filters.
To decrease the false positive rate of a Bloom filter, we have to add in more functions. Specifically, for an error rate of ε, we need to use log2 ε-1 hash functions. On the other hand, an XOR filters always use exactly three hash functions.
As a consequence of this, lookups in a Bloom filter typically are slower than in an XOR filter, since each table slot probed is essentially in a random location and probably causes a cache miss. With Bloom filters, we have log2 ε-1 cache misses per item. With XOR filters, we have three cache misses per item.
Bloom filters use more space. A Bloom filter with error rate ε needs a table of size 1.44n log2 ε-1. An XOR filter has an array of 1.23n items, each of which is log2 ε-1 bits long, for a total space usage of 1.23n log2 ε-1.
XOR filters have one main disadvantage relative to Bloom filters, and that's that all the items to store in the XOR filter must be known in advance before the filter is constructed. This contrasts with Bloom filters, where items can be added incrementally over a long period of time. Aside from this, though, the XOR filter offers better performance and memory usage.
For more information about XOR filters, along with how they compare in terms of time and space to Bloom filters and cuckoo filters, check out this set of lecture slides, which explains how they work, along with where the 1.23 constant comes from and why we always use three hash functions.

Related

How to efficiently apply XOR to two integer arrays?

I have two arrays as follows:
A = [1,2,35,4,32,1,2,56,43,2,21]
B = [1,2,35,4,32,1,2,56,43,45,1]
As we can see that A and B has initial subsequence same till element 43. My end goal is to calculate the XOR of last uncommon elements of both of these sequences. Here, my goal is to find XOR of {2,21,45,1}.
Currently, my approach is to store running XOR of both of these arrays in two separate Arrays (say, RESA[], & RESB[]) and then when ever I am asked to find the the XOR of A[0-10] & B[0-9], I just quickly perform a single XOR operation as follows:
RESA[10] ^ RESB[9]
This works because while XORing, common elements cancels out.
My problem here is, what if in every query a threshold T is passed. For example, in this case, if the threshold passed is 32 the I have to filter elements that are less than 32 in both A and B and then apply XORing of all such elements. This definitely increases the complexity, and I cannot apply my earlier logic of keeping running XORs of elements.
Please let me know if you have any ideas on how to leverage XOR properties to come up with a constant time approach as before when there was no thresholds.

You have already worked you that you can find the XOR of the uncommon elements by computing the XOR of every element in the two arrays.
XOR is a commutative and associative operator so we can reorder the arrays in any way we like and still have the same total XOR.
In particular, we can reverse sort each array, and then compute the running XOR of each sorted array.
With this preprocessing we can now compute the XOR of all elements above a threshold by using binary search on each sorted array to find how many elements above T, followed by a lookup into the running XOR array.
This gives an O(logn) complexity for each query.
Extension
The above answer assumes that the query is just the threshold 32: i.e. the start is always 0, and the end is always the length of each sequence. (I assume this because the question says the final goal is to compute the XOR of all uncommon elements.)
If the query also consisted of the start and end of the region to be XORed I would suggest a different approach that requires more storage (because it requires all queries to be buffered and sorted):
Sort all the queries by threshold
Maintain a segment tree of the XOR for each sequence, intialized to 0.
Add the values into the sequences in decreasing order, and perform the queries as soon as all values above their threshold have been inserted.
For example, the segment tree for a sequence C=[1,2,35,4,32,1,2,56] would contain:
1
2
35
4
32
1
2
56
1^2
35^4
32^1
2^56
1^2^35^4
32^1^2^56
1^2^35^4^32^1^2^56
Once we have these values we can compute the XOR of any range using log(n) steps. For example, suppose we wanted to compute the XOR of C[1:3] = [2,35,4]. We can do this by xoring 2 with 35^4.

Or of all pairs formed by taking xor of all the numbers in a list.

Or of all pairs formed by taking xor of all the numbers in a list.
eg : 10,15,17
ans = (10^15)|(15^17)|(10^17) = 31 . i have made an o(n*k) algo but need something better than that(n is number of entries and k is no. of bits in each number) .

It may be easiest to think in negatives here.
XOR is basically "not equal to"--i.e., it produces a result of 1 if and only if the two input bits are not equal to each other.
Since you're ORing all those results together, it means you get a 1 bit in the result anywhere there are at least two inputs that have different values at that bit position.
Inverting that, it means that we get a zero in the result only where every input has the same value at that bit position.
To compute that we can accumulate two intermediate values. For one, we AND together all the inputs. This will give us the positions at which every input had a one. For the other, we invert every input, and AND together all those results. This will tell us every position at which all the inputs had the value 0.
OR those together, and we have a value with a 1 where every input was equal, and a zero otherwise.
Invert that, and we get the desired result: 0 where all inputs were equal, and 1 where any was different.
This lets us compute the result with linear complexity (assuming each input value fits into a single word).

Returning i-th combination of a bit array

Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you

I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...

If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.

Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.

Generating a set of pseudorandom numbers satisfying following xor-property?

Given a pseudorandom number generator int64 rand64(), I would like to build a set of pseudo random numbers. This set should have the property that the XOR combinations of each subset should not result in the value 0.
I'm thinking of following algorithm:
count = 0
set = {}
while (count < desiredSetSize)
set[count] = rand64()
if propertyIsNotFullfilled(set[0] to set[count])
continue
count = count + 1
The question is: How can propertyIsNotFullfilled be implemented?
Notes: The reason why I like to generate such a set is following: I have a hash table where the hash values are generated via Zobrist hashing. Instead of keeping a boolean value to each hash table entry indicating if the entry is filled, I thought the hash value – which is stored with each entry – is sufficient for this information (0 ... empty, != 0 ... set). There is another reason to carry this information as sentinel value inside the hash-key-table. I'm trying to switch from a AoS (Array of Structure) to a SoA (Structure of Array) memory layout. I'm trying this to avoid padding and to test if there are lesser cache misses. I hope in most cases the access to the hash-key-table is enough (implied that the hash value provides the information if the entry is empty or not).
I also thought about reserving the most significant bit of the hash values for this information but this would reduce the area of possible hash values more than it is necessary. Theoretically the area would be reduced from 264 (minus the seninal 0-value) to 263.
One can read the question in the other way: Given a set of 84 pseudorandom numbers, is there any number which can't be generated by XORing any subset of this set, and how to get it? This number can be used as sentinel value.
Now, for what I need it: I have developed a connect four game engine. There are 6 x 7 moves possible for player A and also for player B. Thus there are 84 possible moves (therefore 84 random values needed). The hash value of a board-state is generated by the precalculated random values in the following manner: hash(board) = randomset[move1] XOR randomset[move2] XOR randomset[move3] ...

This set should have the property that the XOR combinations of each subset should not result in the value 0.
IMHO this would restrict the maxinum number of subsets to 64 (Pigeonhole principle); for >64 subsets, there will always be a (non empty) subset that XORs to zero. For smaller subsets, the property can be fulfilled.
To further illustrate my point: consider a system of 64 equations over 64 unknown variables. Then, add one extra equation. The fact that the equations and variables are booleans does not make the problem different.
--EDIT/UPDATE--: Since the application appears to be the game "connect-four", you could instead enumerate all possible configurations. Not being able to code the impossible board configurations will save enough coding space to fit any valid board position in 64 bits:
Encoding the colored stones as {A,B}, and irrelevant as {X} the configuration of a (hight=6) column can be one of:
X
X X
X X X
X X X X
X X X X X
_ A A A A A A <<-- possible configurations for one pile
--+--+--+--+--+--+--+
1 1 2 4 8 16 32 <<-- number of combinations of the Xs
-2 -5 <<-- number of impossible Xs
(and similar for B instead of A). The numbers below the piles are the number of posssibilities for the Xs on top, the negative numbers the number of forbidden/impossible configurations. For the column with one A and 4 Xs, every value for the Xs is valid, *except 3*A (the game would already have ended). The same for the rightmost pile: the bottom 3Xs cannot be all A, and X cannot be B for all the Xs.
This leads to a total of 1 + 2 * (63-7) := 113.
(1 is for the empty board, 2 is the number of colors). So: 113 is the number of configurations for one column, fitting well within 7 bit. For 7 columns we'll need 7*7:=49 bits. (we might save one bit for the L/R mirror symmetry, maybe even one for the color symmetry, but that would only complicate things, IMHO).
There still be a lot of coding space wasted (the columns are not independent, the number of As on the board is equal to the number of Bs, or one more, etc), but I don't think it would be easy to avoid them. Fortunately, it will not be necessary.

To amplify wildplasser: every hash function that be used to distinguish every n-bit string from every other n-bit string cannot have output shorter than n bits. Shorter hash functions are usable because we only have to avoid collisions in the strings that actually arrive, but we cannot hope to make an intelligent choice offline. Just use a cryptographically-secure RNG and one of two things will happen: (i) your code will work as though the RNG were truly random or (ii, unlikely) your code will break and (if it's not bugged) it will act as a distinguisher between the crypto RNG and true randomness, bringing you fame and notoriety.

Amplifying the answer by wildplasser a little bit more, here is an idea how to implement propertyIsNotFullfilled.
Represent the set of pseudo-random numbers as a {0,1}-matrix. Perform Gaussian elimination (use XOR instead of usual multiply/subtract operations). If you get matrix where the last row is zero, return true, otherwise false.
Definitely, this function will return true very frequently when size of the set is close to 64. So algorithm in OP is efficient only for relatively small sizes.
To optimize this algorithm, you can keep the result of last Gaussian elimination.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?

What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?

Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html

If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.

Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))

Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio