Random Pairings that don't Repeat - algorithm

This little project / problem came out of left field for me. Hoping someone can help me here. I have some rough ideas but I am sure (or at least I hope) a simple, fairly efficient solution exists.
Thanks in advance.... pseudo code is fine. I generally work in .NET / C# if that sheds any light on your solution.
Given:
A pool of n individuals that will be meeting on a regular basis. I need to form pairs that have not previously meet. The pool of individuals will slowly change over time. For the purposes of pairing, (A & B) and (B & A) constitute the same pair. The history of previous pairings is maintained. For the purpose of the problem, assume an even number of individuals. For each meeting (collection of pairs) and individual will only pair up once.
Is there an algorithm that will allow us to form these pairs? Ideally something better than just ordering the pairs in a random order, generating pairings and then checking against the history of previous pairings. In general, randomness within the pairing is ok.
A bit more:
I can figure a number of ways to create a randomized pool from which to pull pairs of individuals. Check those against the history and either throw them back in the pool or remove them and add them to the list of paired individuals. What I can't get my head around is that at some point I will be left with a list of individuals that cannot be paired up. But... some of those individuals could possibly be paired with members that are in the paired list. I could throw one of those partners back in the pool of unpaired members but this seems to lead to a loop that would be difficult to test and that could run on forever.

Interesting idea for converting a standard search into a probability selection:
Load the history in a structure with O(1) "contains" tests e.g. a HashSet of (A,B) pairs.
Loop through each of 0.5*n*(n-1) possible pairings
check if this pairing is in history
if not then continue to the next iteration of loop
increase "number found" counter
save pairing as "result" with probability 1/"number found" (i.e. always for the first unused pairing found)
Finally if "result" has an answer then use it, else all possibilities are exhausted
This will run in O(n^2) + O(size of history), and nicely detects the case when all probabilities are exhausted.

Based on your requirements, I think what you really need is quasi-random numbers that ultimately result in uniform coverage of your data (i.e., everyone pairs up with everyone else one time). Quasi-random pairings give you a much less "clumped" result than simple random pairings, with the added benefit that you have a much much greater control of the resulting data, hence you can control the unique pairings rule without having to detect whether the newly randomized pairings duplicate the historically randomized pairings.
Check this wikipedia entry:
http://en.wikipedia.org/wiki/Low-discrepancy_sequence
More good reading:
http://www.google.com/url?sa=t&source=web&cd=10&ved=0CEQQFjAJ&url=http%3A%2F%2Fwww.johndcook.com%2Fblog%2F2009%2F03%2F16%2Fquasi-random-sequences-in-art-and-integration%2F&ei=6KQXTMSuDoG0lQfVwPmbCw&usg=AFQjCNGQga_MKXJgfEQnQXy1qrHcwfOr4Q&sig2=ox7FB0mnToQbrOCYm9-OpA
I tried to find a C# library that would help you generate the sort of quasi-random spreads you're looking for, but the only libs I could find were in c/c++. But I still recommend downloading the source since the full logic of the quasi-random algorithms (look for quasi-Monte Carlo) is there:
http://www.gnu.org/software/gsl/

I see that as a graph problem where individuals are Nodes and vertex join individuals not yet related. With this reformulation create new pairs is simply to find a set of independant vertexes (without any common node).
That is not yet an answer but there is chances that this is a common graph problem with well known solutions.
One thing we can say at that point is that in some cases there may be no solution (you would have to redo some previous pairs).
It may also be simpler to consider dual graph (exchanging role of vertexes and nodes: nodes would be pairs and common individual between pairs vertexes).

at startup, build a list of all possible pairings.
add all possible new pairings to this list as individuals are added, and remove any expired pairings as individuals are removed from the pool.
select new pairings randomly from this list, and remove them from the list when the pairing is selected.

Form an upper diagonal matrix with your elements
Individual A B C D
A *
B * *
C * * *
D * * * *
Each blank element will contain True if the pair have been formed and False if not.
Each pairing session consist of looping through columns for each row until a False is found, form the pair and set the matrix element to true.
When deleting an individual, delete row and column.
If performance is an issue, you can keep the last pair formed for a row in a counter, updating it carefully when deleting
When adding an individual, add a last row & col.

Your best bet is probably:
Load the history in a structure with fast access e.g. a HashSet of (A,B) pairs.
Create a completely random set of pairings (e.g. by randomly shuffling the list of individuals and partitioning into adjacent pairs)
Check if each pairing is in the history (both (A,B) and (B,A) should be checked)
If none of the pairings are found, you have a completely new pairing set as required, else goto 2
Note that step 1 can be done once and simply updated when new pairings are created if you need to efficiently create large numbers of new unique pairings.
Also note that you will need to take some extra precautions if there is a chance that all possible pairings will be exhausted (in which case you need to bail out of the loop!)

Is there any way of ordering two elements? If so, you can save one (possibly only half) a hash probe per iteration by always ordering a pair the same way. So, if you have A, B, C and D, the generated possible pairs would be [AB, CD] [AC, BD] or [AD, BC].
What I'd do then is something like:
pair_everyone (pool, pairs, history):
if pool is empty:
all done, update global history, return pairs
repeat for pool_size/2:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
until pair not in history or all possible pairs tried:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
if pair is not in history:
result=pair_everyone(pool-e1-e2, pairs+pair, history+pair)
if result != failure:
return result
else:
return failure

How about:
create a set CI of all current individuals
then:
randomly select one individual A and remove from CI
create a new set of possible partners PP by copying CI and removing all previous partners of A
if PP is empty scan the list of pairs found and swap A for an individual C who is paired with someone not in A's history and who still has possible partners in CI. Recalculate PP for A = C.
if PP is not empty select one individual B from PP to be paired with A
remove B from CI
repeat until no new pair can be found

Related

Efficient way to find all possible combinations of elements in a list which add up to a target

I have got a situation where I need to find all possible combinations of numbers in a list which when added equals to a target. For example consider a list - (4,6,10,20,30,40) and target = 20. So possible combinations are (4,6,10) and (20).
Few optimizations I've considered are:
Ignore all items in the list which are > target
Add all numbers in the list and if the sum <= target then
If sum=target then return the entire list(as this will be the only combination that when added equals the target)
else sum can't be determined from the numbers in the list.
Proceed to find all subsets from the list which when added equals the target.
The above logic works fine for smaller lists (size not greater than 10). But in the worst case scenario say the list has up to 100 elements then I need to look into 2^100 possible combinations (this may also cause integer overflows for the counter keeping track of the number of subsets) and this will slow down the program. Computing subsets will be O(2^n) n=size of the list and order of growth is exponential.
Is there a better way to tackle this problem for larger lists? I'm looking for a generic approach to solve this problem without using any language specific features. Any help or suggestion is appreciated.

How to perform minimum splits to satisfy special set ordering?

I'm trying to create an algorithm to solve the following problem:
Input is an unsorted list of sets containing pairs (key, value) of ints. The first of each pair is positive and unique within the set.
I want to find an algorithm to split the input sets so the sets can be ordered such that for each key the value is nondecreasing in the set order.
There is a trival solution which is to split the sets into each individual value and sort them, I'd like something more efficient in terms of the number of sets which are split.
Are there any similar problems you have encountered and/or techniques you can suggest?
Does the optimal (minimum number of splits) solution sound like it is possible in polynomial time?
Edit: In the example the "<=" operator indicates a constraint on the sets as a whole whereby for each key value (100, 101, 102) the corresponding values are equal to or greater than the values in previous sets (or omitted from the set). I.e extracting the values for each key using the order from the output sets gives:
Key 100 {0, 1}
Key 101 {2, 3}
Key 102 {10, 15}
A*
I propose using A* to find an optimal solution. Build the order of split sets incrementally from left to right, minimizing the number of sets required to achieve this.
A* visits states based on some heuristic estimate of the total cost. I propose that a state is described by the totality of all the pairs already included in the order as we have it so far. If all values for every key are different, then you can represent this information rather concisely by simply storing the last value for each key. Otherwise you'll have to somehow take care of equal values, so you know which ones were already included and which ones were not. For every state you maintain some representation of the best order leading to it, but that may get updated along the way while the state remains the same.
The heuristic should be an estimate of the total cost of the path from the beginning through the current state to the goal. It may be too low, but must never be too high. In our case, the heuristic should count the number of (possibly split) sets included in the order so far, and add to that the number of (unsplit) sets still waiting for insertion. As the remaining sets may need splitting, this might be too low, but as you can never have less sets than those still waiting for insertion, it is a suitable heuristic.
Now you have some priority queue of states, ordered by the value of this heuristic. You extract minimal items from it, and know that the moment you extract a state from the queue, the cost up to that state can not decrease any more, so the path up to that state is optimal. Now you examine what other states can be reached from this: which other pairs can be next in the order of split sets? For each remaining set which has pairs that are ready to be included, you create a new subsequent state, taking all the pairs from the set which are ready. The cost so far increases by one. If you manage to take a whole set, without splitting, then the extimate for the remaining cost decreases by one.
For this new state, you check whether it is already persent in your priority queue. If it is, and its previous cost was higher than the one just computed, then you update its cost, and the optimal path leading to it. Make sure the priority key changes its position accordingly (“decrease key”). If the state wasn't present in the queue before, then add it to the queue.
Dijkstra
Come to think of it, this is the same as running Dijkstra's algorithm with the number of splits as cost. And as each edge has either cost zero or cost one, you can implement this even easier, without any priority queue at all. Instead, you can use two sets, called S₀ and S₁, where all elements from S₀ require the same number of splits, and all elements from S₁ require one more split. Roughly sketched in pseudocode:
S₀ = ∅ (empty set)
S₁ = ∅
add initial state (no pairs added yet, all sets remain to be added) to S₀
while True
while (S₀ ≠ ∅)
x = take and remove any element from zero
if x is the target state (all pairs included in the order) then
return the path information associated with it
for (r: those sets which remain to be added in state x)
if we can take r as a whole then
let y be the state obtained by taking r as the next set in the order
if y is in S₁, remove it
add y to S₀
else if we can add only some elements from r then
let y bet the state obtained by taking as many elements from r as possible
if y is not in S₀, add it to S₁
S₀ = S₁
S₁ = ∅

Shuffle bag with repetition constraint

I'm looking for an algorithm to randomize a set of items with length n, where there might be multiples (1 to m) of each item. An additional constraint is that the same item may not appear within k items of the previous one.
You may assume n is well under 100, and there always is a solution, i.e. m as well as k are small. You can also change the input to a list of <item, frequency> pairs if that helps.
To give a bit of context, assume I'm generating missions in a game and have a set of goals to choose from. Some goals may appear multiple times (e.g. "kill the boss"), but should not be close to each other, so simply shuffling the "bag" is no good.
I could shuffle the list, then iterate over it while keeping track of item intervals, starting with a new shuffle if it fails the test, but I'm looking for a more elegant solution that should also be compact, practical and easily implemented with e.g. C, C++ or JavaScript. In other words it should not rely on special language features or standard library functions that I might not understand or could find hard time implementing. However, you may assume the most common list operations such as sorting and shuffling are available.
If you want uniform probabilities over the set of valid outcomes, my hunch is that the rejection scheme you proposed (shuffle and then restart if the arrangement is bad) is going to be the easiest to code correctly, understand, read, and maintain as well as probably fairly close to the fastest, assuming that the numbers are such that most permutations are valid.
Here's another simple approach, though, based on greedily choosing valid values and hoping you don't knock yourself out. It's not at all guaranteed to find a solution if there are many invalid permutations (high m and k).
shuffled = list of length n
not_inserted = {0, 1, ..., n-1}
for each item i, frequency m_i, nearness constraint k_i:
valid = not_inserted
do m_i times:
choose an index j from valid
shuffled[j] = i
not_inserted.remove(j)
valid.remove(j-k_i, j-k_i+1, ..., j, ..., j+k_i)
If valid is ever empty, the partial solution you've built up is bad, and you'll probably have to restart. I'm guessing that failures will be less likely if you do the loop in order of decreasing m_i.
I'm not sure about how often this approach fails in comparison to the sorting/rejecting approach (it'd be interesting to implement and run it for some numbers...). I'd guess that it might be faster in situations where k is moderately high, but usually slower, because shuffles are really fast for n << 100.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

How do I pick the most beneficial combination of items from a set of items?

I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?
Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.
If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)

Resources