Algorithm to calculate all possible subset

Algorithm to calculate all possible subset - algorithm

This should be a quite simple problem, but I don't have proper algorithmic training and find myself stuck trying to solve this.
I need to calculate the possible combinations to reach a number by adding a limited set of smaller numbers together.
Imagine that we are playing with LEGO and I have a brick that is 12 units long and I need to list the possible substitutions I can make with shorter bricks. For this example we may say that the available bricks are 2, 4, 6 and 12 units long.
What might be a good approach to building an algorithm that can calculate the substitions? There are no bounds on how many bricks I can use at a time, so it could be 6x2 as well as 1x12, the important thing is I need to list all of the options.
So the inputs are the target length (in this case 12) and available bricks (an array of numbers (arbitrary length), in this case [2, 4, 6, 12]).
My approach was to start with the low number and add it up until I reach the target, then take the next lowest and so on. But that way I miss out on the combinations of multiple numbers and when I try to factor that in, it gets really messy.

I suggest a recursive approach: given a function f(target,permissibles) to list all representations of target as a combination of permissibles, you can do this:
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles)
if you do not want to differentiate between 12 = 4+4+2+2 and 12=2+4+2+4, you need to sort permissibles in the descending order and do
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles.remove(larger than x))

Related

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?

you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.

You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.

I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.

First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Unique combination of 10 questions

How to form a combination of say 10 questions so that each student (total students = 10) get unique combination.
I don't want to use factorial.

you can use circular queue data structure
now you can cut this at any point you like , and it then it will give you a unique string
for example , if you cut this at point between 2 and 3 and then iterate your queue, you will get :
3, 4, 5, 6, 7, 8, 9, 10, 1, 2
so you need to implement a circular queue, then cut it from 10 different points (after 1, after 2[shown in picture 2],after 3,....)

There are 3,628,800 different permutations of 10 items taken 10 at a time.
If you only need 10 of them you could start with an array that has the values 1-10 in it. Then shuffle the array. That becomes your first permutation. Shuffle the array again and check to see that you haven't already generated that permutation. Repeat that process: shuffle, check, save, until you have 10 unique permutations.
It's highly unlikely (although possible) that you'll generate a duplicate permutation in only 10 tries.
The likelihood that you generate a duplicate increases as you generate more permutations, increasing to 50% by the time you've generated about 2,000. But if you just want a few hundred or less, then this method will do it for you pretty quickly.
The proposed circular queue technique works, too, and has the benefit of simplicity, but the resulting sequences are simply rotations of the original order, and it can't produce more than 10 without a shuffle. The technique I suggest will produce more "random" looking orderings.

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!

Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).

How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.

Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.

Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Can I do better than binary search here?

I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?

As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.

had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).

Tinyurl-style unique code: potential algorithm to prevent collisions

I have a system that requires a unique 6-digit code to represent an object, and I'm trying to think of a good algorithm for generating them. Here are the pre-reqs:
I'm using a base-20 system (no caps, numbers, vowels, or l to prevent confusion and naughty words)
The base-20 allows 64 million combinations
I'll be inserting potentially 5-10 thousand entries at once, so in theory I'd use bulk inserts, which means using a unique key probably won't be efficient or pretty (especially if there starts being lots of collisions)
It's not out of the question to fill up 10% of the combinations so there's a high potential for lots of collisions
I want to make sure the codes are non-consecutive
I had an idea that sounded like it would work, but I'm not good enough at math to figure out how to implement it: if I start at 0 and increment by N, then convert to base-20, it seems like there should be some value for N that lets me count each value from 0-63,999,999 before repeating any.
For example, going from 0 through 9 using N=3 (so 10 mod 3): 0, 3, 6, 9, 2, 5, 8, 1, 4, 7.
Is there some magic math method for figuring out values of N for some larger number that is able to count through the whole range without repeating? Ideally, the number I choose would sort of jump around the set such that it wasn't obvious that there was a pattern, but I'm not sure how possible that is.
Alternatively, a hashing algorithm that guaranteed uniqueness for values 0-64 million would work, but I'm way too dumb to know if that's possible.

All you need is a number that shares no factors with your key space. Easiest value is to use a prime number. You can google for large primes, or use http://primes.utm.edu/lists/small/10000.txt

Any prime number which is not a factor of the length of the sequence should be able to span the sequence without repeating. For 64000000, that means you shouldn't use 2 or 5. Of course, if you don't want them to be generated consecutively, generating them 2 or 5 apart is probably also not very good. I personally like the number 73973!

There is another method to get a similar result (jumping over the entire set of the values without repeating, nonconsequtively), without using the primes - by using maximum length sequences, which you can generate using specially constructed shift registers.

My math is a bit rusty, but I think you just need to ensure that the GCF of N and 64 million is 1. I'd go with a prime number (that doesn't divide evenly into 64 million) just in case though.

#Nick Lewis:
Well, only if the prime number doesn't divide 64 million. So, for the questioner's purposes, numbers like 2 or 5 would probably not be advisable.

Don't reinvent the wheel:
http://en.wikipedia.org/wiki/Universally_Unique_Identifier

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio