We are writing c# program that will help us to remove some of unnecessary data repeaters and already found some repeaters to remove with help of this Finding overlapping data in arrays. Now we are going to check maybe we can to cancel some repeaters by other term. The question is:
We have arrays of numbers
{1, 2, 3, 4, 5, 6, 7, ...}, {4, 5, 10, 100}, {100, 1, 20, 50}
some numbers can be repeated in other arrays, some numbers can be unique and to belong only to specific array. We want to remove some arrays when we are ready to lose up to N numbers from the arrays.
Explanation:
{1, 2}
{2, 3, 4, 5}
{2, 7}
We are ready to lose up to 3 numbers from these arrays it means that we can remove array 1 cause we will lose only number "1" it's only unique number. Also we can remove array 1 and 3 cause we will lose numbers "1", "7" or array 3 cause we will lose number "7" only and it less than 3 numbers.
In our output we want to give maximum amount of arrays that can be removed with condition that we going to lose less then N where N is number of items we are ready to lose.
This problem is equivalent to the Set Cover problem (e.g.: take N=0) and thus efficient, exact solutions that work in general are unlikely. However, in practice, heuristics and approximations are often good enough. Given the similarity of your problem with Set Cover, the greedy heuristic is a natural starting point. Instead of stopping when you've covered all elements, stop when you've covered all but N elements.
You need to first get a number for each array which tells you hwo many numbers are unique to that particular array.
An easy way to do this is O(n²) since for each element, you need to check through all arrays if it's unique.
You can do this much more efficiently by having sorted arrays, sorting first or using a heap-like data structure.
After that, you only have to find a sum so that the numbers for a certain amount of arrays sum up to N.That's similar to the subset sum problem, but much less complex because N > 0 and all your numbers are > 0.
So you simply have to sort these numbers from smallest to greatest and then iterate over the sorted array and take the numbers as long as the sum < N.
Finally, you can remove every array that corresponds to a number which you were able to fit into N.
Related
Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.
I have N integers, for example 3, 1, 4, 5, 2, 8, 7. There may be some duplicates. I want to divide this sequence into contiguous subsequences such that we can form from them non-decreasing sequence. How to calculate minimal number of cuts? For the example mentioned above, the answer is 6, because we can partition this sequence into {3}, {1}, {4, 5}, {2}, {7}, {8} and then form {1, 2, 3, 4, 5, 7, 8}. What is the fastest way to do this?
Does anyone know how to solve it assuming that some numbers may be equal?
I would cut the array into non-decreasing segments at points where the values decrease, and then use these segments as input into a (single) merge phase - as in a sort-merge - keeping with the same segment, where possible, in the case of ties. Create additional locations for cuts when you have to switch from one segment to another.
The output is sorted, so this produces enough cuts to do the job. Cuts are produced at points where the sequence decreases, or at points where a gap must be created because the original sequence jumps across a number present elsewhere - so no sequence without all of these cuts can be rearranged into sorted order.
Worst case for the merge overhead is if the initial sequence is decreasing. If you use a heap to keep track of what sequences to pick next then this turns into heapsort with cost n log n. Handle ties by pulling all occurrences of the same value from the heap and only then deciding what to do.
This approach works if the list does not contain duplicates. Perhaps those could be taken care of efficiently first.
We can compute the permutation inversion vector in O(n * log n) time and O(n) space using a Fenwick tree. Continuous segments of the vector with the same number can represent sections that need not be cut. Unfortunately, they can also return false positives, like,
Array: {8,1,4,5,7,6,3}
Vector: 0,1,1,1,1,2,5
where the 6 and 3 imply cuts in the sequence, [1,4,5,7]. To counter this, we take a second inversion vector representing the number of smaller elements following each element. Continuous segments parallel in both vectors need not be cut:
Array: {8,1,4,5,7,6,3,0}
Vector: 0,1,1,1,1,2,5,7 // # larger preceding
Vector: 7,1,2,2,3,2,1,0 // # smaller following
|---| // not cut
Array: {3,1,4,5,2,8,7}
Vectors: 0,1,0,0,3,0,1
2,0,1,1,0,1,0
|---| // not cut
Array: {3,1,2,4}
Vectors: 0,1,1,0
2,0,0,0
|---| // not cut
I have a set of frequency values and I'd like to find the most likely subsets of values meeting the following conditions:
values in each subset should be harmonically related (approximately multiples of a given value)
the number of subsets should be as small as possible
every subset should have a minimum number of missing harmonics smaller than the highest value
E.g. [1,2,3,4,10,20,30] should return [1,2,3,4] and [10,20,30] (a set with all the values is not optimal because, even if they are harmonically related, there are many missing values)
The brute force method could be to compute all the possible subsets of values in the sets and compute some cost value, but that would take way too long time.
Is there any efficient algorithm to perform this task (or something similar)?
I would reduce the problem to minimum set cover, which, although NP-hard, often is efficiently solvable in practice via integer programming. I'm assuming that it would be reasonable to decompose [1, 2, 3, 4, 8, 12, 16] as [1, 2, 3, 4] and [4, 8, 12, 16], with 4 repeating.
To solve set cover (well, to use stock integer-program solvers, anyway), we need to enumerate all of the maximal allowed subsets. If the fundamental (i.e., the given value) must belong to the set, then, for each frequency, we can enumerate its multiples in order until too many in a row are missing. If not, we try all pairs of frequencies, assume that their fundamental is their approximate greatest common divisor, and extend the subset downward and upward until too many frequencies are missing.
Related questions:
Algorithm to Divide a list of numbers into 2 equal sum lists
divide list in two parts that their sum closest to each other
Let's assume I have a list, which contains exactly 2k elements. Now, I'm willing to split it into two parts, where each part has a length of k while trying to make the sum of the parts as equal as possible.
Quick example:
[3, 4, 4, 1, 2, 1] might be splitted to [1, 4, 3] and [1, 2, 4] and the sum difference will be 1
Now - if the parts can have arbitrary lengths, this is a variation of the Partition problem and we know that's it's weakly NP-Complete.
But does the restriction about splitting the list into equal parts (let's say it's always k and 2k) make this problem solvable in polynomial time? Any proofs to that (or a proof scheme for the fact that it's still NP)?
It is still NP complete. Proof by reduction of PP (your full variation of the Partition problem) to QPP (equal parts partition problem):
Take an arbitrary list of length k plus additional k elements all valued as zero.
We need to find the best performing partition in terms of PP. Let us find one using an algorithm for QPP and forget about all the additional k zero elements. Shifting zeroes around cannot affect this or any competing partition, so this is still one of the best performing unrestricted partitions of the arbitrary list of length k.
In one of my project I encountered a need to generate a set of numbers in a given range that will be:
Exhaustive, which means that it will cover the most of the given
range without any repetition.
It will guarantee determinism (every time the sequence will be the
same). This can be probably achieved with a fixed seed.
It will be random (I am not very versed into Random Number Theory, but I guess there is a bunch of rules that describes randomness. From perspective something like 0,1,2..N is not random).
Ranges I am talking about can be ranges of integers, or of real numbers.
For example, if I used standard C# random generator to generate 10 numbers in range [0, 9] I will get this:
0 0 1 2 0 1 5 6 2 6
As you can see, a big part of given range still remains 'unexplored' and there are many repetitions.
Of course, input space can be very large, so remembering previously chosen values is not an option.
What would be the right way to tackle this problem?
Thanks.
After the comments:
Ok i agree that the random is not the right word, but I hope that you understood what I am trying to achieve. I want to explore given range that can be big so in memory list is not an option. If a range is (0, 10) and i want three numbers i want to guarantee that those numbers will be different and that they will 'describe the range' (i.e. They wont all be in a lower half etc).
Determinism part means that i would like to use something like standard rng with a fixed seed, so I can fully control the sequence.
I hope i made things a bit clearer.
Thanks.
Here's three options with different tradeoffs:
Generate a list of numbers ahead of time, and shuffle them using the fisher-yates shuffle. Select from the list as needed. O(n) total memory, and O(1) time per element. Randomness is as good as the PRNG you used to do the shuffle. The simplest of the three alternatives, too.
Use a Linear Feedback Shift Register, which will generate every value in its sequence exactly once before repeating. O(log n) total memory, and O(1) time per element. It's easy to determine future values based on the present value, however, and LFSRs are most easily constructed for power of 2 periods (but you can pick the next biggest power of 2, and skip any out of range values).
Use a secure permutation based on a block cipher. Usable for any power of 2 period, and with a little extra trickery, any arbitrary period. O(log n) total space and O(1) time per element, randomness is as good as the block cipher. The most complex of the three to implement.
If you just need something, what about something like this?
maxint = 16
step = 7
sequence = 7, 14, 5, 12, 3, 10, 1, 8, 15, 6, 13, 4, 11, 2, 9, 0
If you pick step right, it will generate the entire interval before repeating. You can play around with different values of step to get something that "looks" good. The "seed" here is where you start in the sequence.
Is this random? Of course not. Will it look random according to a statistical test of randomness? It might depend on the step, but likely this will not look very statistically random at all. However, it certainly picks the numbers in the range, not in their original order, and without any memory of the numbers picked so far.
In fact, you could make this look even better by making a list of factors - like [1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16] - and using shuffled versions of those to compute step * factor (mod maxint). Let's say we shuffled the example factors lists like [3, 2, 4, 5, 1], [6, 8, 9, 10, 7], [13, 16, 12, 11, 14, 15]. then we'd get the sequence
5, 14, 12, 3, 7, 10, 8, 15, 6, 1, 11, 0, 4, 13, 2, 9
The size of the factors list is completely tunable, so you can store as much memory as you like. Bigger factor lists, more randomness. No repeats regardless of factor list size. When you exhaust a factor list, generating a new one is as easy as counting and shuffling.
It is my impression that what you are looking for is a randomly-ordered list of numbers, not a random list of numbers. You should be able to get this with the following pseudocode. Better math-ies may be able to tell me if this is in fact not random:
list = [ 1 .. 100 ]
for item,index in list:
location = random_integer_below(list.length - index)
list.switch(index,location+index)
Basically, go through the list and pick a random item from the rest of the list to use in the position you are at. This should randomly arrange the items in your list. If you need to reproduce the same random order each time, consider saving the array, or ensuring somehow that random_integer_below always returns numbers in the same order given some seed.
Generate an array that contains the range, in order. So the array contains [0, 1, 2, 3, 4, 5, ... N]. Then use a Fisher-Yates Shuffle to scramble the array. You can then iterate over the array to get your random numbers.
If you need repeatability, seed your random number generator with the same value at the start of the shuffle.
Do not use a random number generator to select numbers in a range. What will eventually happen is that you have one number left to fill, and your random number generator will cycle repeatedly until it selects that number. Depending on the random number generator, there is no guarantee that will ever happen.
What you should do is generate a list of numbers on the desired range, then use a random number generator to shuffle the list. The shuffle is known as the Fisher-Yates shuffle, or sometimes called the Knuth shuffle. Here's pseudocode to shuffle an array x of n elements with indices from 0 to n-1:
for i from n-1 to 1
j = random integer such that 0 ≤ j ≤ i
swap x[i] and x[j]