Unique combination of 10 questions - data-structures

How to form a combination of say 10 questions so that each student (total students = 10) get unique combination.
I don't want to use factorial.

you can use circular queue data structure
now you can cut this at any point you like , and it then it will give you a unique string
for example , if you cut this at point between 2 and 3 and then iterate your queue, you will get :
3, 4, 5, 6, 7, 8, 9, 10, 1, 2
so you need to implement a circular queue, then cut it from 10 different points (after 1, after 2[shown in picture 2],after 3,....)

There are 3,628,800 different permutations of 10 items taken 10 at a time.
If you only need 10 of them you could start with an array that has the values 1-10 in it. Then shuffle the array. That becomes your first permutation. Shuffle the array again and check to see that you haven't already generated that permutation. Repeat that process: shuffle, check, save, until you have 10 unique permutations.
It's highly unlikely (although possible) that you'll generate a duplicate permutation in only 10 tries.
The likelihood that you generate a duplicate increases as you generate more permutations, increasing to 50% by the time you've generated about 2,000. But if you just want a few hundred or less, then this method will do it for you pretty quickly.
The proposed circular queue technique works, too, and has the benefit of simplicity, but the resulting sequences are simply rotations of the original order, and it can't produce more than 10 without a shuffle. The technique I suggest will produce more "random" looking orderings.

Related

Sorting with limited stack operations

I am working on a sorting machine, and to minimize complexity, I would like to keep the moving parts to a minimum. I've come to the following design:
1 Input Stack
2+ Output Stacks
When starting, machine already knows all the items, their current order, and their desired order.
The machine can move one item from the bottom of the input stack to the bottom of an output stack of its choice.
The machine can move all items from an output stack to the top of the input stack. This is called a "return". (In my machine, I plan for this to be done by the user.)
The machine only accesses the bottom of a stack, except by a return. When a stack is returned to the input, the "new" items will be the last items out of the input. This also means that if the machine moves a set of items from the input to one output, the order of those items is reversed.
The goal of the machine is to take all the items from the input stack, and eventually move them all to an output stack in sorted order. A secondary goal is to reduce the number of "stack returns" to a minimum, because in my machine, this is the part that requires user intervention. Ideally, the machine should do as much sorting as it can without the user's help.
The issue I'm encountering is that I can't seem to find an appropriate algorithm for doing the actual sorting. Pretty much all algorithms I can find rely on being able to swap arbitrary elements. Distribution/external sorting seems promising, but all the algorithms I can find seem to rely on accessing multiple inputs at once.
Since machine already knows all the items, I can take advantage of this and sort all the items "in-memory". I experimented with "path-finding" from the unsorted state to the sorted state, but I'm unable to get it to actually converge on a solution. (It commonly just gets stuck in a loop moving stacks back and forth.)
Preferably, I would like a solution that works with a minimum of 2 output stacks, but is able to use more if available.
Interestingly, this is a "game" you can play with standard playing cards:
Get as many cards as you would like to sort. (I usually get 13 of a suit.)
Shuffle them and put them in your hand. Decide how many output stacks you get.
You have two valid moves:
You may move the front-most card in your hand and put it on top of any output stack.
You may pick up all the cards in an output stack and put them at the back of the cards you have in hand.
You win when the cards are in order in an output stack. Your score is the number of times you picked up a stack. Lower scores are better.
This can be done in O(log(n)) returns of an output to an input. More precisely in no more than 2 ceil(log_2(n)) - 1 returns if 1 < n.
Let's call the output stacks A and B.
First consider the simplest algorithm that works. We run through them, putting the smallest card on B and the rest on A. Then put A on input and repeat. After n passes you've got them in sorted order. Not very efficient, but it works.
Now can we make it so that we pull out 2 cards per pass? Well if we had cards 1, 4, 5, 8, 9, 12, ... in the top half and the rest in the bottom half, then the first pass will find card 1 before card 2, reverse them, the second finds card 3 before card 4, reverses them, and so on. 2 cards per pass. But with 1 pass with 2 returns we can put all the cards we want in the top half on stack A, and the rest on stack B, return stack A, return stack B, and then start extracting. This takes 2 + n/2 passes.
How about 4 cards per pass? Well we want it divided into quarters. With the top quarter having cards 1, 8, 9, 16, .... The second quarter having 2, 7, 10, 15, .... The third having 3, 6, 11, 14, .... And the last having 4, 5, 12, 13, .... Basically if you were dealing them you deal the first 4 in order, the second 4 in reverse, the next for in order.
We can divide them into quarters in 2 passes. Can we figure out how to get there? Well working backwards, after the second pass we want A to have quarters 2,1. And B to have quarters 4,3. Then we return A, return B, and we're golden. So after the first pass we wanted A to have quarters 2,4 and B to have quarters 1,3, return A return B.
Turning that around to work forwards, in pass 1 we put groups 2,4 on A, 1,3 on B. Return A, return B. Then in pass 2 we put groups 1,2 on A, 3,4 on B, return A, return B. Then we start dealing and we get 4 cards out per pass. So now we're using 4 + n/4 returns.
If you continue the logic forward, in 3 passes (6 returns) you can figure out how to get 8 cards per pass on the extract phase. In 4 passes (8 returns) you can get 16 cards per pass. And so on. The logic is complex, but all you need to do is remember that you want them to wind up in order ... 5, 4, 3, 2, 1. Work backwards from the last pass to the first figuring out how you must have done it. And then you have your forward algorithm.
If you play with the numbers, if n is a power of 2 you do equally well to take log_2(n) - 2 passes with 2 log_2(n) - 4 returns and then take 4 extraction passes with 3 returns between them for 2 log_2(n) - 1 returns, or if you take log_2(n) - 1 passes with 2 log_2(n) - 2 returns and then 2 extraction passes with 1 returns between them for 2 log_2(n) - 1 returns. (This is assuming, of course, that n is sufficiently large that it can be so divided. Which means "not 1" for the second version of the algorithm.) We'll see shortly a small reason to prefer the former version of the algorithm if 2 < n.
OK, this is great if you've got a multiple of a power of 2 to get. But what if you have, say, 10 cards? Well insert imaginary cards until we've reached the nearest power of 2, rounded up. We follow the algorithm for that, and simply don't actually do the operations that we would have done on the imaginary cards, and we get the exact results we would have gotten, except with the imaginary cards not there.
So we have a general solution which takes no more than 2 ceil(log_2(n)) - 1 returns.
And now we see why to prefer breaking that into 4 groups instead of 2. If we break into 4 groups, it is possible that the 4th group is only imaginary cards and we get to skip one more return. If we break into 2 groups, there always are real cards in each group and we don't get to save a return.
This speeds us up by 1 if n is 3, 5, 6, 9, 10, 11, 12, 17, 18, ....
Calculating the exact rules is going to be complicated, and I won't try to write code to do it. But you should be able to figure it out from here.
I can't prove it, but there is a chance that this algorithm is optimal in the sense that there are permutations of cards which you can't do better than this on. (There are permutations that you can beat this algorithm with, of course. For example if I hand you everything in reverse, just extracting them all is better than this algorithm.) However I expect that finding the optimal strategy for a given permutation is an NP-complete problem.

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Algorithm to calculate all possible subset

This should be a quite simple problem, but I don't have proper algorithmic training and find myself stuck trying to solve this.
I need to calculate the possible combinations to reach a number by adding a limited set of smaller numbers together.
Imagine that we are playing with LEGO and I have a brick that is 12 units long and I need to list the possible substitutions I can make with shorter bricks. For this example we may say that the available bricks are 2, 4, 6 and 12 units long.
What might be a good approach to building an algorithm that can calculate the substitions? There are no bounds on how many bricks I can use at a time, so it could be 6x2 as well as 1x12, the important thing is I need to list all of the options.
So the inputs are the target length (in this case 12) and available bricks (an array of numbers (arbitrary length), in this case [2, 4, 6, 12]).
My approach was to start with the low number and add it up until I reach the target, then take the next lowest and so on. But that way I miss out on the combinations of multiple numbers and when I try to factor that in, it gets really messy.
I suggest a recursive approach: given a function f(target,permissibles) to list all representations of target as a combination of permissibles, you can do this:
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles)
if you do not want to differentiate between 12 = 4+4+2+2 and 12=2+4+2+4, you need to sort permissibles in the descending order and do
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles.remove(larger than x))

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Can I do better than binary search here?

I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?
As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.
had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).

Resources