How to measure the rate of false positives in a Bloom Filter - data-structures

You have a bloom filter, and you want to measure the rate of false positives, practically (not theoretically).
How do you go about it?
Do you insert N elements into it and count the number of hash collisions and divide by N, and that's it?
Or do you insert N elements and then do membership tests for all other elements which were not inserted (which is infinite in general)?
Or something else?

Imagine that your calculations say you'll have a false positive rate of X when there are N items in the Bloom filter.
Generate 2N unique random keys. Place half of them into the Bloom filter. Now test with the other half. You know the keys are unique, so any "positive" hit you get will be a false positive.
Compare your experimental result with the computed result.

Related

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Alternatives to Bloom Filter

I have tried using bloom filter for performing membership tests. I wish to perform membership tests on 80 billion entries with only allowing around 100 collisions to happen i.e., only 100 entries can be given a false positive result.
I know this can be achieved by bloom filters but using the formulas of determining the number of bits required per entry and the number of hash functions given the allowed false positive rate. I figured I would end up using 270 GB of memory and 19 hash functions.
I also had a look at Cuckoo filter but its memory requirements don't match my requirements. My Requirements are as follows :
Use at max 6 bits per element
Use no more than 7-8 Hash Functions.
Could someone suggest me a probabilistic data structure other than the one's mentioned above that can help in achieving my requirements?
The issue with the number of hash functions is not really a problem - just pick a single hash function with many bits of output and divide up the bits as if they were from separate hash functions. Your real problem here is the false positive rate tradeoff with storage space.
You have said
I wish to perform membership tests on 80 billion entries with only
allowing around 100 collisions to happen i.e., only 100 entries can be
given a false positive result.
Entries that are in the map can, by definition, not be false positives. They are true positives.
The question then is "100 false positives taken over how many entries
that you intend to test?" If the answer is also, oddly enough, 80 billion, then you are asking for a false positive rate of around 100/80,000,000,000 = 1/800,000,000, which is less than 2^-29.
The minimum space for any approximate membership data structure like Bloom filters or cuckoo filters is n lg 1/ε bits, where n is the number of elements in the structure, lg is the logarithm base 2, and ε is the false positive rate. In other words, you need more than 29 bits per element to achieve a false positive rate like 100 per 80 billion. 6 bits per element will get you 1.56% false positive rate at best. That's 1.25 billion per 80 billion, or 100 per 6400.
As far as I know there are no known practical data structures that come close to achieving this. Bloom filters don't, for instance, because they use more than lg 1/ε bits per item. Cuckoo filters don't because they use at least two additional metadata bits per item and have a bits-per-item rate that scales with lg n.

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Limited Sort/Filter Algorithm

I have a rather large list of elements (100s of thousands).
I have a filter that can either accept or not accept elements.
I want the top 100 elements that satisfy the filter.
So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.
But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.
Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elements?
My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:
k = 100
while (k > 0):
select top k from list and remove them
filter them, yielding n successes
k = k - n
All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.
I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.
It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.
Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.
I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See http://pub.uni-bielefeld.de/publication/2305936 (Figure 4.4 et al) for details.
If I understand right, you have two choiced:
Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.
the first sounds shorter then sorting then selecting.
I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.
Steve's suggestion to use Quicksort is a good one.
1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.
You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.

Fast random selection algorithm

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.
A sketch simple algorithm is
a <- the array
c <- 0
for i in a:
if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
while j is false: j++
return j
Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?
Use the "pick a random element from an infinite list" algorithm.
Keep an index of your current pick, and also a count of how many true values you've seen.
When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)
This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.
See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.
Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

Resources