Find "minimum branch" of a tree - solution to warehouse problem - algorithm

I have a warehouse in which I keep goods. Each article occupies given volume of warehouse space and costs give amount of dollars.
Now, because warehouse is full and I'm expecting new delivery, I have to free some space - no less than the new delivery will occupy, but I also have to minimize my loses. In other words, I have to empty at least X cubic meters of warehouse space by throwing out some articles, making sure that value of those articles is the minimum value possible.
Example:
If X=10m3, then I prefer to throw out item that occupies 20m3 and is worth 1000$ than item that occupies exactly 10m3 but is worth 2000$.
Of course in the real calculation it will be necessary to throw out more than one item, probably 5, 10 or maybe even 20.
What I'm thinking of is to represent the above problem as a tree, with nodes representing volume of emptied space and edges representing lost value, then searching for nodes that have value greater or equal to X and then choosing the node that has the lowest lost value along the edges from tree root to that node.
What I'm not sure if this is good approach because for e.g. 100 items in warehouse full tree would have 100 nodes on first depth level, 100*99=9900 nodes on second level etc, so this quickly starts to be prohitibive.
Is there better approach, or maybe even some proven algorithm (that I'm not aware of) for such kind of problems?

This is known as the Knapsack problem.
Change your problem a little to make it compliant with the knapsack: list all the items you have in the warehouse + all the items you wish to add. Search for the best bang for the buck.
Well, in this case it means you would not try to clear space in your warehouse for junk, which I don't think is too stupid.
Ah, in case you didn't follow the link, it NP-complete, so you're in for a word of hurt if you want the best solution :)

Related

Dual knapsack algorithm

Say you have a warehouse with fragile goods (f.e. vegetables or fruits), and you can only take out a container with vegetables once. If you move them twice, they'll rot too fast and cant be sold anymore.
So if you give a value to every container of vegetables (depending on how long they'll still be fresh), you want to sell the lowest value first. And when a client asks a certain weight, you want to deliver a good service, and give the exact weight (so you need to take some extra out of your warehouse, and throw the extra bit away after selling).
I don't know if this problem has a name, but I would consider this the dual form of the knapsack problem. In the knapsack problem, you want to maximise the value and limit the weight to a maximum. While here you want to minimise the value and limit the weight to a minimum.
You can easily see this duality by treating the warehouse as the knapsack, and optimising the warehouse for the maximum value and limited weight to a maximum of the current weight minus what the client asks.
However, many practical algorithms on solving the knapsack problem rely on the assumption that the weight you can carry is small compared to the total weight you can chose from. F.e. the dynamic programming 0/1 solution relies on looping until you reach the maximum weight, and the FPTAS solution guarantees to be correct within a factor of (1-e) of the total weight (but a small factor of a huge value can still make a pretty big difference).
So both have issues when the wanted weight is big.
As such, I wondered if anyone studied the "dual knapsack problem" already (if some literature can be found around it), or if there's some easy modification to the existing algorithms that I'm missing.
The usual pseudopolynomial DP algorithm for solving knapsack asks, for each i and w, "What is the largest total value I can get from the first i items if I use at most w capacity?"
You can instead ask, for each i and w, "What is the smallest total value I can get from the first i items if I use at least w capacity?" The logic is almost identical, except that the direction of the comparison is reversed, and you need a special value to record the possibility that even taking all i of the first i items cannot reach w capacity -- infinity works for this, since you want this value to lose against any finite value when they are compared with min().

sorting algorithm to keep sort position numbers updated

Every once in a while I must deal with a list of elements that the user can sort manually.
In most cases I try to rely on a model using an order sensitive container, however this is not always possible and resort to adding a position field to my data. This position field is a double type, therefore I can always calculate a position between two numbers. However this is not ideal, because I am concerned about reaching an edge case where I do not have enough numerical precision to continue inserting between two numbers.
I am having doubts about the best approach to maintain my position numbers. The first thought is traversing all the rows and give them a round number after every insertion, like:
Right after dropping a row between 2 and 3:
1 2 2.5 3 4 5
After position numbers update:
1 2 3 4 5 6
That of course, might get heavy if I have a high number of entries. Not specially in memory, but to store all new values back to the disk/database. I usually work with some type of ORM and mobile software. Updating all the codes will pull out of disk every object and will set them as dirty, leading to a re-verification of all the related validation rules of my data model.
I could also wait until the precision is not enough to calculate a number between two positions. However the user experience would be bad, since the same operation will no longer require the same amount of time.
I believe that there is an standard algorithm for these cases that regularly and consistently keep the position numbers updated, or just some of them. Ideally it should be O(log n), with no big time differences between the worst and best cases.
Being honest I also think that anything that must be user/sorted, cannot grow as large as to become a real problem in its worst case. The edge case seems also to be extremely rare, even more if I search a solution pushing the border numbers. However I still believe that there is an standard well known solution for this problem which I am not aware of, and I would like to learn about it.
Second try.
Consider the full range of position values, say 0 -> 1000
The first item we insert should have a position of 500. Our list is now :
(0) -> 500 -> (1000).
If you insert another item at first position, we end up with :
(0) -> 250 -> 500 -> (1000).
If we keep inserting items at first position, we gonna have a problem, as our ranges are not equally balanced and... Wait... balanced ? Doesn't it sounds like a binary tree problem !?
Basically, you store your list as a binary tree. When inserting a node, you assign it a position according to surrounding nodes. When your tree become unbalanced, you rotate nodes to make it balanced again and you recompute position for rotated nodes !
So :
Most of the time, adding a node will not require to change position of other nodes.
When balancing is required, only a subset of your items will be changed.
It's O(log n) !
EDIT
If the user is actually sorting the list manually, then is there really any need to worry about taking O(n) to record the new order? It's O(n) in any case just to display the list to the user.
This not really answers the question but...
As you talked about "adding a position field to your data", I suppose that your data store is a relational database and that your data has some kind of identifier.
So maybe you can implement a doubly linked list by adding a previous_data_id and next_data_id to your data. Insert/move/remove operations thus are O(1).
Loading such a collection from a database is rather easy:
Fetch each item and add them to a map with their id as key.
For each item connect it with its previous and next item.
Starting with the first item (previous_data_id is undefined) follow the chain and add them to a list.
After some days with no valid answer. This is my theory:
The real challenge here is a practical solution. Maybe there is a mathematical correct solution, but every day that goes by, it seems that the implementation would be of a great complexity. A good solution should not only be mathematically correct, but also balanced with the nature the problem, the low chances to meet it, and its minor implications. Like how useless it could be killing flies with bullets, although extremely effective.
I am starting to believe that a good answer could be: to the hell with the right solution, leave it like one line calculation and live with the rare case where sorting of two elements might fail. It is not worth to increase complexity and invest time or money in such nity-picky problem, so rare, that causes no data damage, just a temporal UX glitch.

How to solve this linear programing problem?

I'm not so good at linear programing so I'm posting this problem here.
Hope somebody can point me out to the right direction.
It is not homework problem so don't misunderstand.
I have a matrix 5x5 (25 nodes). Distance between each node and its adjacent nodes (or neighbor nodes) is 1 unit. A node can be in 1 of 2 conditions: cache or access. If a node 'i' is a cache node, an access nodes 'j' can be able to access it with a cost of Dij x Aij (Access Cost). Dij is Manhattan distance between node i and j. Aij is access frequency from node i to j.
In order to become a cache node i, it needs to cache from an existing cache node k with a cost of Dik x C where C is a Integer constant. (Cache Cost) . C is called caching frequency.
A is provided as an 25x25 matrix containing all integers that shows access frequency between any pair of node i and j. D is provided as an 25x25 matrix containing all Manhattan distances between any pair of node i and j.
Assume there is 1 cache node in the matrix, find out the set of other cache nodes and access nodes such that the total cost will be minimized.
Total Cost = Total Cache Cost + Total Access Cost .
I've tackled a few problems that are something like this.
First, if you don't need an exact answer, I'd generally suggest looking into something like a genetic algorithm, or doing a greedy algorithm. It won't be right, but it won't generally be bad either. And it will be much faster than an exact algorithm. For instance you can start with all points as cache points, then find the point which reduces your cost most from making it a non-caching point. Continue until removing the next one makes the cost goes up, and use that as your solution. This won't be best. It will generally be reasonably good.
If you do need an exact answer, you will need to brute force search of a lot of data. Assuming that the initial cache point is specified, you'll have 224 = 16,777,216 possible sets of cache points to search. That is expensive.
The trick to doing it more cheaply (note, not cheaply, just more cheaply) is finding ways to prune your search. Take to heart the fact that if doing 100 times as much work on each set you look at lets you remove an average of 10 points from consideration as cache points, then your overall algorithm will visit 0.1% as many sets, and your code will run 10 times faster. Therefore it is worth putting a surprising amount of energy into pruning early and often, even if the pruning step is fairly expensive.
Often you want multiple pruning strategies. One of them is usually "the best we can do from here is worst than the best we have found previously." This works better if you've already found a pretty good best solution. Therefore it is often worth a bit of effort to do some local optimization in your search for solutions.
Typically these optimizations won't change the fact that you are doing a tremendous amount of work. But they do let you do orders of magnitude less work.
My initial try at this would take advantage of the following observations.
Suppose that x is a cache point, and y is its nearest caching neighbor. Then you can always make some path from x to y cache "for free" if you just route the cache update traffic from x to y along that path. Therefore without loss of generality the set of cache points is connected on the grid.
If the minimum cost would could wind up with exceeds the current best cost we have found, we are not on our way to a global solution.
As soon as the sum of the access rate from all points at distance greater than 1 from the cache points plus the highest access frequency of a neighbor to the cache point that you can still use is less than the cache frequency, adding more cache points is always going to be a loss. (This would be an "expensive condition that lets us stop 10 minutes early.")
The highest access neighbor of the current set of cache points is a reasonable candidate for the next cache point to try. (There are several other heuristics you can try, but this one is reasonable.)
Any point whose total access frequency exceeds the cache frequency absolutely must be a caching point.
This might not be the best set of observations to use. But it is likely to be pretty reasonable. To take advantage of this you'll need at least one data structure you might not be familiar with. If you don't know what a priority queue is, then look around for an efficient one in your language of choice. If you can't find one, a heap is pretty easy to implement and works pretty well as a priority queue.
With that in mind, assuming that you have been given the information you've described and an initial cache node P, here is pseudo-code for an algorithm to find the best.
# Data structures to be dynamically maintained:
# AT[x, n] - how many accesses x needs that currently need to go distance n.
# D[x] - The distance from x to the nearest cache node.
# CA[x] - Boolean yes/no for whether x is a cache node.
# B[x] - Boolean yes/no for whether x is blocked from being a cache node.
# cost - Current cost
# distant_accesses - The sum of the total number of accesses made from more than
# distance 1 from the cache nodes.
# best_possible_cost - C * nodes_in_cache + sum(min(total accesses, C) for non-cache nodes)
# *** Sufficient data structures to be able to unwind changes to all of the above before
# returning from recursive calls (I won't specify accesses to them, but they need to
# be there)
# best_cost - The best cost found.
# best_solution - The best solution found.
initialize all of those data structures (including best)
create neighbors priority queue of neighbors of root cache node (ordered by accesses)
call extend_current_solution(neighbors)
do what we want with the best solution
function extend_current_solution (available_neighbors):
if cost < best_cost:
best_cost = cost
best_solution = CA # current set of cache nodes.
if best_cost < best_possible_cost
return # pruning time
neighbors = clone(available_neighbors)
while neighbors:
node = remove best from neighbors
if distant_accesses + accesses(node) < C:
return # this is condition 3 above
make node in cache set
- add it to CA
- update costs
- add its immediate neighbors to neighbors
call extend_current_solution
unwind changes just made
make node in blocked set
call extend_current_solution
unwind changes to blocked set
return
It will take a lot of work to write this, and you'll need to be careful to maintain all data structures. But my bet is that - despite how heavyweight it looks - you'll find that it prunes your search space enough to run more quickly than your existing solution. (It still won't be snappy.)
Good luck!
Update
When I thought about this more, I realized that a better observation is to note that if you can cut the "not a cache node, not a blocked node" set into two pieces, then you can solve those pieces independently. Each of those sub problems is orders of magnitude faster to solve than the whole problem, so seek to do so as fast as possible.
A good heuristic to do that is to follow the following rules:
While no edge has been reached:
Drive towards the closest edge. Distance is measured by how short the shortest path is along the non-cache, non-blocked set.
If two edges are equidistant, break ties according to the following preference order: (1, x), (x, 1), (5, x), (x, 5).
Break any remaining ties according to preferring to drive towards the center of an edge.
Break any remaining ties randomly.
While an edge has been reached and your component still has edges that could become cache pieces:
If you can immediately move into an edge and split the edge pieces into two components, do so. Both for "edge in cache set" and "edge not in cache set" you'll get 2 independent subproblems that are more tractable.
Else move on a shortest path towards the piece in the middle of your section of edge pieces.
If there is a tie, break it in favor of whatever makes the line from the added piece to the added cache element as close to diagonal as possible.
Break any remaining ties randomly.
If you fall through here, choose randomly. (You should have a pretty small subproblem at this point. No need to be clever.)
If you try this starting out with (3, 3) as a cache point, you'll find that in the first few decisions you'll find that 7/16 of the time you manage to cut into two even problems, 1/16 of the time you block in the cache point and finish, 1/4 of the time you manage to cut out a 2x2 block into a separate problem (making the overall solution run 16 times faster for that piece) and 1/4 of the time you wind up well on your way towards a solution that is on its way towards either being boxed in (and quickly exhausted), or else being a candidate for a solution with a lot of cache points that gets pruned for being on track to being a bad solution.
I won't give pseudo-code for this variation. It will have a lot of similarities to what I had above, with a number of important details to handle. But I would be willing to bet money that in practice it will run orders of magnitude faster than your original solution.
The solution is a set, so this is not a linear programming problem. What it is is a special case of connected facility location. Bardossy and Raghavan have a heuristic that looks promising: http://terpconnect.umd.edu/~raghavan/preprints/confl.pdf
is spiral cache an analogy to the solution? http://strumpen.net/ibm-rc24767.pdf

Calculate shortest path through a grocery store

I'm trying to find a way to find the shortest path through a grocery store, visiting a list of locations (shopping list). The path should start at a specified start position and can end at multiple end positions (there are multiple checkout counters).
Also, I have some predefined constraints on the path, such as "item x on the shopping list needs to be the last, second last, or third last item on the path". There is a function that will return true or false for a given path.
Finally, this needs to be calculated with limited CPU power (on a smartphone) and within a second or so. If this isn't possible, then an approximation to the optimal path is also ok.
Is this possible? So far I think I need to start by calculating the distance between every item on the list using something like A* or Dijkstra's. After that, should I treat it like the traveling salesman problem? Because in my problem there is a specified start node, specified end nodes, and some constraints, which are not in the traveling salesman problem.
It seems like viewing it as a TSP problem makes it more difficult. Someone pointed out that grocery stories aren't that complicated. In the grocery stores I am familiar with (in the US), there aren't that many reasonable routes. Especially if you have a given starting point. I think a well thought-out heuristic will probably do the trick.
For example: I typically start at one end-- if it's a big trip I make sure I go through the frozen foods last, but it often doesn't matter and I'll start closest to where I enter the store. I generally walk around the outside, only going down individual aisles if I need something in that one. Once you go into an aisle, pick up everything in that one. With some aisles its better to drop into one end, grab the item, and go back to your starting point, and others you just commit to the whole aisle-- it's a function of the last item you need in that aisle and where you need to be next-- how to get out of the aisle depends on the next item needed-- it may or may not involve a backtrack-- but the computer can easily calculate the shortest path to the next items.
So I agree with the helpful hints of the other problems above, but maybe a less general algorithm will work. And will probably work better with limited resources. TSP tells us, however, you can't prove that it's the optimal approach but I suspect that's not really necessary...
Well, basically it is a traveling salesman problem, but with your requirements you limit the search space pretty well. If there are not too many locations, you could try to calculate all possible options, but if that is not feasible, you need to limit the search space even more.
You can make the search time limited so that you will return the shortest path you can find in 1 second, but that might not be the shortest of them all, but pretty short anyway.
You could also apply Greedy algorithm to find the next closest location, then using Backtracking select the next closest location etc.
Pseudo code for possible solution:
def find_shortest_path(current_path, best_path):
if time_limit()
return
if len(current_path) == NUM_LOCATIONS:
# destionation reached
if calc_len(current_path) < calc_len(best_path):
best_path = current_path
return
# You need to sort the possible locations well to maximize your chances of finding the
# the shortest solution
sort(possible_locations)
for location in possible_locations:
find_shortest_path(current_path + location, best_path)
Well, you could limit the search-space using information about the layout of the store. For instance, a typical store (at least here in Germany) has many shelves in it that can be considered to form lanes. In between there are orthogonal lanes that connect the shelf-lanes. Now you define the crossings to be the nodes and the lanes to be edges in a graph. The edges are labeled with all the items in the shelves of that section of lane. Now, even for a big store this graph would be pretty small. You would now have to find the shortest path that includes all the edge-labels (items) you need. This should be possible by using the greedy/backtracking approach Tuomas Pelkonen suggested.
This is just an idea, and I do not know if it really works, but maybe you can take it from here.
Only breadth first searching will make sure you don't "miss" a path through the store which is better than you're current "best" solution, but you need not search every node in the path. Nodes which are "obviously" longer than the current "best" solution may be expanded later.
This means you approach the problem like a "breath first" search, but alter the expansion of your nodes based on the current distance travelled. Some branches of the search tree will expand faster than others, because the manage to visit more nodes in the same amount of time.
So if node expansion is not truly breath-first, why do I keep using that word? Because after you do find a solution, you must still expand the current set of "considered nodes" until each one of those search paths exceed the solution. This avoids missing a path which has a lot of time consuming initial legs, but finishes faster than the current solution because the last step is lighting fast.
The requirement of a start node is fictitious. Using the TSP you'll end up with a tour where you can chose the start node that you want without altering the solution cost.
It's somewhat trickier when it comes to the counters: what you need is to solve the problem on a directed graph with some arcs missing (or, which is the same, where some arcs have a really high cost).
Starting with the complete directed graph you should modify the costs of the proper arcs in order to:
deny going from the items to the start node
deny going from the counters to the items
deny going from the start node to the counters
allow going from the counters to the start node at zero cost (this are only needed to close the path)
after having drawn the thing down, tell me if I missed something :)
HTH

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Resources