Best algorithm to find N unique random numbers in VERY large array - performance

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.

If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.

counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)

I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Related

How to quickly search for a specified element in an ordered array consisting of only two types of elements?

The array mentioned in the question are as follows:
[1,1,...,1,1,-1,-1,...,-1,-1]
How to quickly find the index of the 1 closest to -1?
Note: Both 1 and -1 will exist at the same time, and the number of 1 and -1 is large.
For example, for an array like this:
[1,1,1,1,1,-1,-1,-1]
the result should be 4.
The fastest way I can think of is binary search, is there a faster way?
With the current representation of the data, binary search is the fastest way I can thing of. Of course, you can cache and reuse the results in constant time since the answer is always the same.
On the other hand if you change the representation of the array to some simple numbers you can find the next element in constant time. Since the data can always be mapped to a binary value, you can reduce the whole array to 2 numbers. The length of the first partition and the length of the second partition. Or the length of the whole array and the partitioning point. This way you can easily change the length of both partitions in constant time and have access to the next element of the second partition in constant time.
Of course, changing the representation of the array itself is a logarithmic process since you need to find the partitioning point.
By a simple information theoretic argument, you can't be faster than log(n) using only comparisons. Because there are n possible outcomes, and you need to collect at least log(n) bits of information to number them.
If you have extra information about the statistical distribution of the values, then maybe you can exploit it. But this is to be discussed on a case-by-case basis.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Which sorting algorithm should I use for a list with a lot of replications?

I want to sort an array of 1 million integers. What would be the best algorithm to use knowing that the universe of the array's integers are from 1 to 100? Note that this means that there are a lot of items replicated. Furthermore, the array is randomly distributed.
You create an array of 100 elements (with one for each possible value) and simply count how many there are of each. Running time: O(n), with each element of the original array accessed only once, so you're unlikely to find a faster one. :)
Or to give it its proper name, use a counting sort.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Generating ids for a set of integers

Background:
I'm working with permutations of the sequence of integers {0, 1, 2 ... , n}.
I have a local search algorithm that transforms a permutation in some systematic way into another permutation. The point of the algorithm is to produce a permutation that minimises a cost function. I'd like to work with a wide range of problems, from n=5 to n=400.
The problem:
To reduce search effort I need to be able to check if I've processed a particular permutation of integers before. I'm using a hash table for this and I need to be able to generate an id for each permutation which I can use as a key into the table. However, I can't think of any nice hash function that maps a set of integers into a key such that collisions do not occur too frequently.
Stuff I've tried:
I started out by generating a sequence of n prime numbers and multiplying the ith number in my permutation with the ith prime then summing the results. The resulting key however produces collisions even for n=5.
I also thought to concatenate the values of all numbers together and take the integer value of the resulting string as a key but the id quickly becomes too big even for small values of n. Ideally, I'd like to be able to store each key as an integer.
Does stackoverflow have any suggestions for me?
Zobrist hashing might work for you. You need to create an NxN matrix of random integers, each cell representing that element i is in the jth position in the current permutation.
For a given permutation you pick the N cell values, and xor them one by one to get the permutation's key (note that key uniqueness is not guaranteed).
The point in this algorithm is, that if you swap to elements in your permutations, you can easily generate the new key from the current permutation by simply xor-ing out the old and xor-ing in the new positions.
Judging by your question, and the comments you've left, I'd say your problem is not possible to solve.
Let me explain.
You say that you need a unique hash from your combination, so let's make that rule #1:
1: Need a unique number to represent a combination of an arbitrary number of digits/numbers
Ok, then in a comment you've said that since you're using quite a few numbers, storing them as a string or whatnot as a key to the hashtable is not feasible, due to memory constraints. So let's rewrite that into another rule:
2: Cannot use the actual data that were used to produce the hash as they are no longer in memory
Basically, you're trying to take a large number, and store that into a much smaller number range, and still have uniqueness.
Sorry, but you can't do that.
Typical hashing algorithms produce relatively unique hash values, so unless you're willing to accept collisions, in the sense that a new combination might be flagged as "already seen" even though it hasn't, then you're out of luck.
If you were to try a bit-field, where each combination has a bit, which is 0 if it hasn't been seen, you still need large amounts of memory.
For the permutation in n=20 that you left in a comment, you have 20! (2,432,902,008,176,640,000) combinations, which if you tried to simply store each combination as a 1-bit in a bit-field, would require 276,589TB of storage.
You're going to have to limit your scope of what you're trying to do.
As others have suggested, you can use hashing to generate an integer that will be unique with high probability. However, if you need the integer to always be unique, you should rank the permutations, i.e. assign an order to them. For example, a common order of permutations for set {1,2,3} is the lexicographical order:
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1
In this case, the id of a permutation is its index in the lexicographical order. There are other methods of ranking permutations, of course.
Making ids a range of continuous integers makes it possible to implement the storage of processed permutations as a bit field or a boolean array.
How fast does it need to be?
You could always gather the integers as a string, then take the hash of that, and then just grab the first 4 bytes.
For a hash you could use any function really, like MD5 or SHA-256.
You could MD5 hash a comma separated string containg your ints.
In C# it would look something like this (Disclaimer: I have no compiler on the machine I'm using today):
using System;
using System.Security.Cryptography;
using System.Text;
public class SomeClass {
static Guid GetHash(int[] numbers) {
string csv = string.Join(',', numbers);
return new Guid(new MD5CryptoServiceProvider().ComputeHash(Encoding.ASCII.GetBytes(csv.Trim())));
}
}
Edit: What was I thinking? As stated by others, you don't need a hash. The CSV should be sufficient as a string Id (unless your numbers array is big).
Convert each number to String, concatenate Strings (via StringBuffer) and take contents of StringBuffer as a key.
Not relates directly to the question, but as an alternative solution you may use Trie tree as a look up structure. Trie trees are very good for strings operations, its implementation relatively easy and it should be more faster (max of n(k) where k is length of a key) than hashset for a big amount of long strings. And you aren't limited in key size( such in a regular hashset in must int, not bigger). Key in your case will be a string of all numbers separated by some char.
Prime powers would work: if p_i is the ith prime and a_i is the ith element of your tuple, then
p_0**a_0 * p_1**a_1 * ... * p_n**a_n
should be unique by the Fundamental Theorem of Arithmetic. Those numbers will get pretty big, though :-)
(e.g. for n=5, (1,2,3,4,5) will map to 870,037,764,750 which is already more than 32 bits)
Similar to Bojan's post it seems like the best way to go is to have a deterministic order to the permutations. If you process them in that order then there is no need to do a lookup to see if you have already done any particular permutation.
get two permutations of same series of numbers {1,.., n}, construct a mapping tupple, (id, permutation1[id], permutation2[id]), or (id, f1(id), f2(id)); you will get an unique map by {f3(id)| for tuple (id, f1(id), f2(id)) , from id, we get a f2(id), and find a id' from tuple (id',f1(id'),f2(id')) where f1(id') == f2(id)}

Resources