I was out buying groceries the other day and needed to search through my wallet to find my credit card, my customer rewards (loyalty) card, and my photo ID. My wallet has dozens of other cards in it (work ID, other credit cards, etc.), so it took me a while to find everything.
My wallet has six slots in it where I can put cards, with only the first card in each slot initially visible at any one time. If I want to find a specific card, I have to remember which slot it's in, then look at all the cards in that slot one at a time to find it. The closer it is to the front of a slot, the easier it is to find it.
It occurred to me that this is pretty much a data structures question. Suppose that you have a data structure consisting of k linked lists, each of which can store an arbitrary number of elements. You want to distribute elements into the linked lists in a way that minimizes looking up. You can use whatever system you want for distributing elements into the different lists, and can reorder lists whenever you'd like. Given this setup, is there an optimal way to order the lists, under any of the assumptions:
You are given the probabilities of accessing each element in advance and accesses are independent, or
You have no knowledge in advance what elements will be accessed when?
The informal system I use in my wallet is to "hash" cards into different slots based on use case (IDs, credit cards, loyalty cards, etc.), then keep elements within each slot roughly sorted by access frequency. However, maybe there's a better way to do this (for example, storing the k most frequently-used elements at the front of each slot regardless of their use case).
Is there a known system for solving this problem? Is this a well-known problem in data structures? If so, what's the optimal solution?
(In case this doesn't seem programming-related: I could imagine an application in which the user has several drop-down lists of commonly-used items, and wants to keep those items ordered in a way that minimizes the time required to find a particular item.)
Although not a full answer for general k, this 1985 paper by Sleator and Tarjan gives a helpful analysis of the amortised complexity of several dynamic list update algorithms for the case k=1. It turns out that move-to-front is very good: assuming fixed access probabilities for each item, it never requires more than twice the number of steps (moves and swaps) that would be required by the optimal (static) algorithm, in which all elements are listed in nonincreasing order of probability.
Interestingly, a couple of other plausible heuristics -- namely swapping with the previous element after finding the desired element, and maintaining order according to explicit frequency counts -- don't share this desirable property. OTOH, on p. 2 they mention that an earlier paper by Rivest showed that the expected amortised cost of any access under swap-with-previous is <= the corresponding cost under move-to-front.
I've only read the first few pages, but it looks relevant to me. Hope it helps!
You need to look at skip lists. There is a similar problem with arranging stations for a train system where there are express trains and regular trains. An express train stops only at express stations while regular trains stop at regular stations and express stations. Where should the express stops be placed so that one can minimize the average number of stops when travelling from a start station to any station.
The solution is to use stations at ternary numbers (i.e., at 1, 3, 6, 10 etc where T_n = n * (n + 1) / 2).
This is assuming all stops (or cards) are equally likely to be accessed.
If you know the access probabilities of your n cards in advance and you have k wallet slots and accesses are independent, isn't it fairly clear that the greedy solution is optimal? That is, the most frequently-accessed k cards go at the front of the pockets, next-most-frequently accessed k go immediately behind, and so forth? (You never want a lower-probability card ranked before a higher-probability card.)
If you don't know the access probabilities, but you do know they exist and that card accesses are independent, I imagine sorting the cards similarly, but by number-of-accesses-seen-so-far instead is asymptotically optimal. (Move-to-front is cool too, but I don't see an obvious reason to use it here.)
Perhaps you get something interesting if you penalise card moves as well; if I have any known probability distribution on card accesses, independent or not, I just greedily re-sort the cards every time I do an access.
Related
I'm making a matchmaking client that matches 10 people together into two teams:
Each person chooses four people they would like to play with, ranked from highest to lowest.
Two teams are then formed out of the strongest relationships in that set.
How would you create an algorithm that solves this problem?
Example:
Given players [a, b, c, d, e, f, g, h, i, j], '->' meaning a preference pick.
a -> b (weight: 4)
a -> c (weight: 3)
a -> d (weight: 2)
a -> e (weight: 1)
b -> d (weight: 4)
b -> h (weight: 3)
b -> a (weight: 2)
...and so on
This problem seemed simple on the surface (after all it is only just a matchmaking client), but after thinking about it for a while it seems that there needs to be quite a lot of relationships taken into account.
Edit (pasted from a comment):
Ideally, I would avoid a brute-force approach to scale to larger games which require 100 players and 25 teams, where picking your preferred teammates would be done through a search function. I understand that this system may not be the best for its purpose - however, it is an interesting problem and I would like to find an efficient solution while learning something along the way.
A disclaimer first.
If your user suggested this, there are two possibilities.
Either they can provide the exact details of the algorithm, so ask them.
Or they most probably don't know what they are talking about, and just generated a partial idea on the spot, in which case, it's sadly not worth much on average.
So, one option is to search how matchmaking works in other projects, disregarding the idea completely.
Another is to explore the user's idea.
Probably it won't turn into a good system, but there is a chance it will.
In any case, you will have to do some experiments yourself.
Now, to the case where you are going to have fun exploring the idea.
First, for separating ten items into two groups of five, there are just choose(10,5)=252 possibilities, so, unless the system has to do it millions of times per second, you can just calculate some score for all of them, and choose the best one.
The most straightforward way is perhaps to consider all 2^{10} = 1024 ways to form a subset of 10 elements, and then explore the ones where the size of the subset is 5.
But there may be better, more to-the-point, tools readily available, depending on the language or framework.
The 10-choose-5 combination is one group, the items not taken are the other group.
So, what would be the score of a combination?
Now we look at our preferences.
For each preference satisfied, we can add its weight, or its weight squared, or otherwise, to the score.
Which works best would sure need some experimentation.
Similarly, for each preference not satisfied, we can add a penalty depending on its weight.
Next, we can consider all players, and maybe add more penalty for each of the players which has none of their preferences satisfied.
Another thing to consider is team balance.
Since the only data so far are preferences (which may well turn out to be insufficient), an imbalance means that one team has many of their preferences satisfied, and the other has only few, if any at all.
So, we add yet another penalty depending on the absolute difference of (satisfaction sum of the first team) and (satisfaction sum of the second team).
Sure there can be other things to factor in...
Based on all this, construct a system which at least looks plausible on the surface, and then experiment and experiment again, tweaking it so that it better fits the matchmaking goals.
I would think of a way to score proposed teams against the selections from people, such as scoring proposed teams against the weights.
I would try and optimise this by hill-climbing (e.g. swapping a pair of people and looking to see if that improves the score) if only because people could look at the final solution and try this themselves - so you don't want to miss improvements of this sort.
I would hill-climb multiple times, from different starting points, and pick the answer found with the best score, because hill-climbing will probably end at local optima, not global optima.
At least some of the starting points should be based on people's original selections. This would be easiest if you got people's selections to amount to an entire team's worth of choices, but you can probably build up a team from multiple suggestions if you say that you will follow person A's suggestions, and then person B's selection if needed, and then person C's selection if needed, and so on.
If you include as starting points everybody's selections, or selections based on priority ABCDE.. and then priority BCDE... and then priority CDEF... then you have the property that if anybody submits a perfect selection your algorithm will recognise it as such.
If your hill-climbing algorithm tries swapping all pairs of players to improve, and continues until it finds a local optimum and then stops, then you also have the property that if anybody submits a selection which is only one swap away from perfection, your algorithm will recognise it as such.
I have a homework question in my algorithms class that asks the following:
You have a game board and a path to the end. You move one step at a
time. At each 'position' you step to there is a stack of cards (a
subset of the standard 52 card deck). There could be 1 card, 2 cards,
3 cards, etc. No duplicates, and there is at least one card.
The purpose of the game is to pick a card at each position. You
cannot select the same card twice. By the time you reach the end,
you want the total face value of the cards to be minimal.
Devise an algorithm that, given how many positions there are, and what set of cards are at each position, find the minimal combination of cards to pick up.
I don't really know where to start. I could do an exhaustive search but I fear that would not be efficient enough. I know that it's not as simple as just picking the smallest card at each position. Since you cannot pick the same card twice, you might encounter a situation where it's optimal to pick a slightly higher value card initially, then the much cheaper one at a later stage. I considered creating a 'decision tree', but that wouldn't help with time complexity either.
Use backtracking to find all possible paths, and then choose the path that yields the minimum value.
You can pre-process the data with two (possibly more) rules:
If one stack of cards in the path has a card that no other stack has, you can reduce the cards in that stack to the unique card.
If any stack in the path has only one card, you can remove all similar cards from other stacks.
You're still looking at a worst case of roughly 52! if the path is very long.
After looking into this for quite a while, I have not found an algorithm that has a worst case time complexity that is better than O(s!) where s is the number of stacks. For the pure theorist the problem can be solved in constant time as the number of cards and stacks has an upper limit. It is just a very ... big constant. The big-O notation only makes sense if your input size has no upper limit (we could achieve that requirement if we would allow the use of a custom card set, where the face value could go from 1..n, where n is variable).
Still I would like to pass on some thoughts that can help write an algorithm that performs well in many (but not all) configurations:
Definitions
A card's id is the combination of rank (1..13) and suit (clubs, hearts, diamonds, spades). Rank 0 = Ace of clubs, rank 1 = Ace of Hearts, ... rank 4 = Two of clubs, ... rank 7 = Two of Spades, ... rank 51 = King of Spades, the last one.
s is the number of stacks
ni is the number of cards on stack number i.
Algorithm
Mark all cards with the number of the stack they are in;
Sort the cards on each stack by increasing id (so with the lowest id at the top of the stack), and also create a map (by id) so the position of a cards in the sorted stack can be quickly found by its id.
Create a another list which has all cards together, also sorted by increasing id. Also create buckets by id for this overall sorted list, to allow direct lookup by id. As cards are not unique across stacks there can be duplicate id values.
Clean-up 1: For every stack i with ni > s: remove ni - s cards from the bottom of that stack so that the resulting stack size is at most s. This is because there is no way such a card could be part of an optimal solution. There are at most s-1 duplicates in that stack, so the first non-duplicate card will be at position <=s in the sort order of that stack. Whichever way you pick cards from the other stacks, at least one of the first s cards can be picked from this stack. There is no reason to take a card further towards the bottom of that stack. So they can be removed. Make sure to replicate the removal of cards in the overall list of step 3. After this step there are at most s² cards in the configuration.
Clean-up 2: Go through the sorted list of all cards: whenever a unique id X is found (i.e. it is alone in its bucket), get the stack number of that card X, and remove from that stack any card below card X. There is no way that these removed cards contribute to an optimal solution. Imagine such a card would be picked: it could be replaced by the unique card X and provide an equal or better solution (i.e. lower sum of ranks). Make sure to replicate the removal of cards in the overall list of step 3. This step could make some cards unique that were not unique before. As this step progresses through all the cards in order, these newly-become-unique cards will be treated in the same way. Once this step is completed we have stacks that have all their cards being duplicate with a card in another stack, except for maybe one card, which then is the card at the bottom of the stack.
Now we can pick a card.
If there is a stack with no card then this branch in the search tree offered no solution. Backtrack as we should have picked a card from this stack earlier on.
If there is a stack with just one card, pick that card.
Otherwise: It is clear that picking a card with minimal id (i.e. at the start of the overall sorted list), will lead to an optimal solution. Imagine you would create a solution without picking a card with this minimal id, then you would have taken another card from the stack(s) which have these minimal id cards. But then you could simply improve the solution by replacing one of these picked cards by one of the minimal id cards (in the same stack). If there is only one such minimal id card, take that one. Otherwise the algorithm has to branch of in as many branches as there are duplicates of this minimal id card. This represents a node in the typical search tree this algorithm has to walk through.
As you pick a card, remove all other cards of that stack (applying the change to the overal list as well), and the stack itself, thereby reducing the value s with 1.
If at this point s is 0, then we have a "solution", but maybe not the best one. If it is better than the one we had so far, register this as the best solution. Backtrack in order to visit other branches in the search tree, which might still have better solutions. If, on the other hand, s > 0, repeat from step 4 onwards (with this decreased s).
Note that when backtracking (in steps 6 or 7), you need also to restore the data structure (removed cards should be added again to the stacks and in the overall list in their sorted position).
The above algorithm still performs bad when you have a lot of duplicates.
Prune branches in search tree
If you would have a way to find a lower-bound of the value-sum you can potentially reach in the current branch of the search tree, then you could benefit from this knowledge and sometimes backtrack at an earlier stage (i.e. "prune" a branch): as soon as this lower bound is equal or higher than the best solution found so far, there is no use in continuing the search in that branch; it can never lead to a better solution. The better (i.e. higher) you can set this lower bound, the better performance the algorithm will have.
Here are a few ideas for calculating a lower bound for the sum:
First of all, this lower-bound sum would obviously include the sum of the card ranks that have already been taken at this point in the search. To this the minimum values should be added of the cards that still need to be picked. I suggest two ways to do that:
Add the ranks of the cards at the tops of the remaining stacks. Some of these values could belong to duplicate cards, and so they might not actually contribute to the solution, but a true solution would have values which are equal or greater than these values. So this sum will represent a lower bound.
Alternatively, you could identify the s lowest and distinct id values that are still available in the overall sorted list, and take the sum of those. Any lower sum than that would have to use duplicates, which is not allowed, so this represents a lower bound. On the other hand an actual solution might have a greater sum, because this lower bound might have counted values which were on the same stack, which is not allowed either.
The calculation should be done incrementally, i.e. one should avoid to have to calculate it from scratch again and again. Instead, as each step in the algorithm is performed and cards are removed/picked, this lower-bound should be adapted accordingly, which will be more efficient.
A smarter combination of the two above methods could be used. However, the smarter you make it, the more time it will take to calculate it, and the algorithm might or might not get a performance improvement because of it.
So far my conclusions. The most difficult configurations have many duplicates. For instance a configuration with 52 stacks, where all stacks have almost all cards (just a few cards taken out here and there), will be difficult to solve quickly.
I am required to solve a specific problem.
I'm given a representation of a social network.
Each node is a person, each edge is a connection between two persons. The graph is undirected (as you would expect).
Each person has a personal "affinity" for buying a product (to simplify things, let's say there's only one product involved in this whole problem).
In each "step" in time, each person, independently, chooses whether to buy the product or not.
There's probability invovled here. A few parameters are taken into account:
His personal affinity for the product,
The percentage of his friends that already bought the product
The gain for a person buying the product is 1 dollar.
The problem is to point out X persons (let's say, 5 persons) that will receive the product in step 0, and will maximize the total expected value of the gain after Y steps (let's say, 10 steps)
The network is very large. It's not possible to simulate all the options in a naive way.
What tool / library / algorithm should I be using?
Thank you.
P.S.
When investigating this matter in google and wikipedia, a few terms kept popping up:
Dynamic network analysis
Epidemic model
but it didn't help me to find an answer
Generally, people who have the most neighbours have the most influence when they buy something.
So my heuristic would be to order people first by the number of neighbours they have (in decreasing order), then by the number of neighbours that each of those neighbours has (in order from highest to lowest), and so on. You will need at most Y levels of neighbour counts, though fewer may suffice in practice. Then simply take the first X people on this list.
This is only a heuristic, because e.g. if a person has many neighbours but most or all of them are likely to have already bought the product through other connections, then it may give a higher expectation to select a different person having fewer neighbours, but whose neighbours are less likely to already own the product.
You do not need to construct the entire list and then sort it; you can construct the list and then insert each item into a heap, and then just extract the highest-scoring X people. This will be much faster if X is small.
If X and Y are as low as you suggest then this calculation will be pretty fast, so it would be worth doing repeated runs in which instead of starting with the first X people owning the product, for each run you randomly select the initial X owners according to a probability that depends on their position in the list (the further down the list, the lower the probability).
Check out the concept of submodularity, a pretty powerful mathematical concept. In particular, check out slide 19, where submodularity is used to answer the question "Given a social graph, who should get free cell phones?". If you have access, also read the corresponding paper. That should get you started.
Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.
Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.