Group incoming and outgoing invoices to make their sum 0 - algorithm

I've faced an interesting problem today, and decided to write an algorithm in C# to solve it.
There are incoming invoices with negative totals and outgoing invoices with positive totals. The task is to make groups out of these invoices, where the total of the invoices adds up to exactly 0. Each group can contain unlimited members, so if there are two positive and one negative members but they total value is 0, it's okay.
We try to minimize the sum of the remaining invoices' totals, and there are no other constraints at all.
I'm wondering if this problem could be traced back to a known problem, and if not, which would be the most effective way to do this. The naive approach would be to separate incoming and outgoing invoices into two different groups, sort by total, then to try add invoices one by one until zero is reached or the sign has changed. However, this presumes that the invoices in a group should be approximately of the same magnitude, which is not true (one huge incoming invoice could be put against 10 smaller outgoing ones)
Any ideas?

The problem you are facing is a well known and studied one, and is called The Subset Sum Problem.
Unfortunately, the problem is NP-Complete, so there is no known polynomial solution for it1.
In fact, there is no known polynomial solution to even determine if such a subset (even a single one) exists, let alone find it.
However, if your input consists of relatively small (absolute value) integers, there is a pretty efficient (pseudo polynomial) dynamic programming solution that can be utilized to solve the problem.
If this is not the case some other alternatives are:
Using exponential solution like brute force (you might be able to optimize it using branch and bound technique)
Heuristical solutions, such as Steepest Ascent Hill Climbing or Genethic Algorithms.
Approximation algorithms
(1) And most computer science researchers believe one does not exist, this is basically the P VS NP Problem.

Related

Travelling Salesman with multiple salesmen with a limit on number of cities per salesman?

Problem: I need to drop (n) employees from office to their homes(co-ordinates available). I have (x) 7-seater & (y) 4-seater cabs available.
I have to design an algorithm to drop all the employees to their homes while travelling minimum distance.
Also, the algorithm must tell me how many 7-seater or/and 4-seater vehicles I must choose so as to travel minimum distance.
eg. If I have 15 employees then the algorithm may tell me to use 1 (7-seater) cab & 2 (4-seater) cab & have the employees in each cab as following:
[(E2, E4, E6, E8), (E1, E3, E5, E7, E9, E10, E12), (E11, E13, E14, E15)]
Approach: I'm thinking of this as a Travelling Salesman Problem with multiple salesmen with an upper limit on number of cities each can travel. Also salesmen do not need to come back to the origin. Ant's colony problem came to my mind, but I can't really choose wisely which algorithm to choose
Requirement: I really need the ALGORITHM. Either TSP or Ant's colony, doesn't matter. I'll welcome opinions, but I really need the ALGORITHM.
This is a cost minimization problem, not a travelling salesman problem. It is related to TSP in the sense that TSP is a very specific cost minimization problem.
The solution consists of three steps:
Generate a list of employee drop-off points (nodes)
Create distinct paths that do not intersect, nor branch. These will be your routes and help prevent wasteful route overlaps. Use cost(path) = distance(furthest node and origin) + taxi_cost(nodes) + sum(distance between nodes) to compare paths and/or brute-force all potential networks. Networks are layouts of paths. DO NOT BRANCH THE PATHS!!
Total distance is a line of defense against waste ensuring that routes are not too long.
Sum of distances helps the algorithm converge on neighbourhoods where many employees live (when possible).
Because this variation of the coin problem allows imperfect solutions, it reduces to a variant of the Knapsack Problem. The utility of each taxi is capacity. If you also wish to choose the cheapest way to transport your employees, utility(taxi) = capacity/cost. From this our simplest solution is to be greedy; who cares about empty space? If you really care about filling up taxis perfectly (as opposed to cost efficiently), you'll need a much more complex solution. You only specify the least distance as your metric (with each additional taxi multiplying cost). I assume this is a proxy to say 'I don't want to pay too much'.
Therefore: taxi_cost(nodes) = math.floor(amount(nodes)/max(utility(taxis)+1). This equation selects the cheapest, roomiest taxi, and figures out how many of them are required to fully service the route.
Be sure to calculate the cost of each network you examine as sum(cost(path))
Once you've found the cheapest network to service, for each path in the chosen network:
make a list of employees travelling to the furthest node
fill the preferred taxi with those employees
repeat with the next furthest node until you have a full taxi, then add the filled taxi to the list. If you run out of employees, you've finished assigning taxis to the route. (The benefit of furthest-first selection is that you can ask employees in unfilled taxis to walk if that part of the route is within blocks of the office).
The algorithm above is not perfect, but it will have many desirable tendencies.
routes will be as short as possible and cover the greatest possible area (by not looping or branching)
routes will tend to service neighbourhoods, rather than trying to overlap responsibilities. This part of the algorithm isn't optimal, but is effective. This makes it really easy to remove service routes without needing to recalculate the transportation network.
the taxis chosen will be cost-efficient, helping to avoid paying more than necessary.
routes will use as few taxis as possible, taking into account the relative cost of upgrading to roomier ones with higher capacity
because the taxis travelling furthest will be full, it has less of an impact on your employee's ability to get to work if you decide to cancel service to emptier taxis.
Every step closer to perfection costs you many times more than the previous step, so diminished returns are acceptable if the solution provide desirable features. Although the algorithm makes some potentially sub-optimal tradeoffs, they come with huge value; your network of taxi routes becomes much easier to modify.
If you'd like to make an optimal solution, the Knapsack Problem, Coin Problem, and Change-making Problem help determine the cost of taxis and routes.
Spanning Trees are the most effective way to determine routes. Center the spanning tree at the office and calculate the cost of each branch as the maximum distance from the office. Try to keep each branch servicing areas with high density to make it easier to add and remove taxi routes.
Studying pathfinding can help you learn how to determine good cost functions so that you can numerically compare different potential paths. Remember that your network consists of a set of paths, but will require its own cost function so that you can compare different layouts.
I've written an in-depth guide to pathfinding for this answer. Pathfinding articles are few and just don't go into enough depth for a lot of problem spaces. A good cost function can get you a nearly perfect solution if you have multiple priorities. Unfortunately, good cost functions are domain specific so you will need to identify them yourself. Feel free to message me if you aren't sure how to make a path with certain traits and I'll help you figure out a good cost function.
It's a constraint satisfaction problem, not really a TSP. If you're looking for a project that may be able to help you, you could look into cspdb, which is what I wrote some time ago:
https://github.com/gtoonstra/cspdb
You'd be using a database in the backend that maintains the state and write a couple of scripts in it's own grammar that manipulates that state. A couple of examples are included to solve nqueens and classroom scheduling with multiple constraints.
From a list d destinations you can make the array of pairwise travel costs c. where c[a,b] is the travel cost from a to b...
Now you have a start point p. add to the array c2 values for p to each point in d.
So you now have the concept of groups.
You can look at this as a greedy algorithm. Given the list c2 you can take the cheapest option given your state.
Your state is the vector all you cab vectors (the costs of getting from where ever they are to where ever they could go next) .* the assignment vector where k == 0 for k in the. You find the minimum option given your state (considering adding another onerous to a cab of 4 costs the difference between the 4 person cab and the 7 person cab and adding a person to a zero person cab or adding a new cab also has a cost. Once all your people have been assigned to their cabs you have an answer.
The Idea of a greedy algorithm is most often characterized by the backpack problem but it can also be implemented for statistical methods such as feature selection.
Like #Aaron3468 this approach is not perfect and does not guarantee the best solution.
If you want the best possible solutions you can iterate through all the combinations but this becomes impractical quickly.
From my point of view your algorithm should solve 2 problems: the number of cars of each type and the shortest distance (how you number your employees depends on you or you should give more details). Sorry I'm a using a phone and I don't have have all features of the site.
For the number of cars you can use below algorithm. To solve issues related to distances, you should give more info about the paths and their lengths. A graph algorithm may then be combined with this to do the trick. Here 11=7+4.
Begin
Integer temp:= n/11
Integer rem:= n mod 11
If rem=0
x:=temp
y:=temp
Else if rem<=4
x:=temp
y:=temp+1
Else if rem<=7
x:=temp+1
y:=temp
Else
x:=temp+1
y:=temp+1
Endif
End

Ideas for heuristically solving travelling salesman with extra constraints

I'm trying to come up with a fast and reasonably optimal algorithm to solve the following TSP/hamiltonian-path-like problem:
A delivery vehicle has a number of pickups and dropoffs it needs to
perform:
For each delivery, the pickup needs to come before the
dropoff.
The vehicle is quite small and the packages vary in size.
The total carriage cannot exceed some upper bound (e.g. 1 cubic
metre). Each delivery has a deadline.
The planner can run mid-route, so the vehicle will begin with a number of jobs already picked up and some capacity already taken up.
A near-optimal solution should minimise the total cost (for simplicity, distance) between each waypoint. If a solution does not exist because of the time constraints, I need to find a solution that has the fewest number of late deliveries. Some illustrations of an example problem and a non-optimal, but valid solution:
I am currently using a greedy best first search with backtracking bounded to 100 branches. If it fails to find a solution with on-time deliveries, I randomly generate as many as I can in one second (the most computational time I can spare) and pick the one with the fewest number of late deliveries. I have looked into linear programming but can't get my head around it - plus I would think it would be inappropriate given it needs to be run very frequently. I've also tried algorithms that require mutating the tour, but the issue is mutating a tour nearly always makes it invalid due to capacity constraints and precedence. Can anyone think of a better heuristic approach to solving this problem? Many thanks!
Safe Moves
Here are some ideas for safely mutating an existing feasible solution:
Any two consecutive stops can always be swapped if they are both pickups, or both deliveries. This is obviously true for the "both deliveries" case; for the "both pickups" case: if you had room to pick up A, then pick up B without delivering anything in between, then you have room to pick up B first, then pick up A. (In fact a more general rule is possible: In any pure-delivery or pure-pickup sequence of consecutive stops, the stops can be rearranged arbitrarily. But enumerating all the possibilities might become prohibitive for long sequences, and you should be able to get most of the benefit by considering just pairs.)
A pickup of A can be swapped with any later delivery of something else B, provided that A's original pickup comes after B was picked up, and A's own delivery comes after B's original delivery. In the special case where the pickup of A is immediately followed by the delivery of B, they can always be swapped.
If there is a delivery of an item of size d followed by a pickup of an item of size p, then they can be swapped provided that there is enough extra room: specifically, provided that f >= p, where f is the free space available before the delivery. (We already know that f + d >= p, otherwise the original schedule wouldn't be feasible -- this is a hint to look for small deliveries to apply this rule to.)
If you are starting from purely randomly generated schedules, then simply trying all possible moves, greedily choosing the best, applying it and then repeating until no more moves yield an improvement should give you a big quality boost!
Scoring Solutions
It's very useful to have a way to score a solution, so that they can be ordered. The nice thing about a score is that it's easy to incorporate levels of importance: just as the first digit of a two-digit number is more important than the second digit, you can design the score so that more important things (e.g. deadline violations) receive a much greater weight than less important things (e.g. total travel time or distance). I would suggest something like 1000 * num_deadline_violations + total_travel_time. (This assumes of course that total_travel_time is in units that will stay beneath 1000.) We would then try to minimise this.
Managing Solutions
Instead of taking one solution and trying all the above possible moves on it, I would instead suggest using a pool of k solutions (say, k = 10000) stored in a min-heap. This allows you to extract the best solution in the pool in O(log k) time, and to insert new solutions in the same time.
You could initially populate the pool with randomly generated feasible solutions; then on each step, you would extract the best solution in the pool, try all possible moves on it to generate child solutions, and insert any child solutions that are better than their parent back into the pool. Whenever the pool doubles in size, pull out the first (i.e. best) k solutions and make a new min-heap with them, discarding the old one. (Performing this step after the heap grows to a constant multiple of its original size like this has the nice property of leaving the amortised time complexity unchanged.)
It can happen that some move on solution X produces a child solution Y that is already in the pool. This wastes memory, which is unfortunate, but one nice property of the min-heap approach is that you can at least handle these duplicates cheaply when they arrive at the front of the heap: all duplicates will have identical scores, so they will all appear consecutively when extracting solutions from the top of the heap. Thus to avoid having duplicate solutions generate duplicate children "down through the generations", it suffices to check that the new top of the heap is different from the just-extracted solution, and keep extracting and discarding solutions until this holds.
A note on keeping worse solutions: It might seem that it could be worthwhile keeping child solutions even if they are slightly worse than their parent, and indeed this may be useful (or even necessary to find the absolute optimal solution), but doing so has a nasty consequence: it means that it's possible to cycle from one solution to its child and back again (or possibly a longer cycle). This wastes CPU time on solutions we have already visited.
You are basically combining the Knapsack Problem with the Travelling Salesman Problem.
Your main problem here seems to be actually the Knapsack Problem, rather then the Travelling Salesman Problem, since it has the one hard restriction (maximum delivery volume). Maybe try to combine the solutions for the Knapsack Problem with the Travelling Salesman.
If you really only have one second max for calculations a greedy algorithm with backtracking might actually be one of the best solutions that you can get.

maximize profit with n products satisfying certain constraints

I am given a list of n products with associated profits and costs per unit. The aim is to maximize the profits while keeping the total cost below some threshold. For each product either one or zero are produced.
Now suppose we have three products and Suppose we label these products 1,2 and 3. Then all possible combinations of productions can be given as the binary numbers 111,110,101,011,100,010,001 and 000, where a 1 in the i^th position denotes a production of one of product i and similarly for zero. We could then easily check which of these combinations has a production cost under the threshold and has the maximum profit. This algorithm would then be of order O(2^n) because for n products we have to check 2^n binary numbers. We can probably make this a little faster by recognizing that if 100 is above the threshold already we need not check 110 and 111 and some stuff like this but the order will not change because of this. How can I make a smarter algorithm maybe that has a better time complexity. The n can be as large as 100 in which case checking 2^100 numbers is not possible. Thanks in advance
If your costs are integers that are not too big, you can use the dynamic programming solution for the knapsack problem, which is listed in the link mentioned in David Eisenstat's comment. If your costs are either big integers or fractional, then your best bet is using one of the existing knapsack solvers that e.g. reduce to an integer linear programming problem and then do something like branch and bound in order to solve. At any rate, your problem IS the knapsack problem, with the only slight modification that you don't have to fill the knapsack completely, you can fill it partially as long as you don't overfill it. However this variant is also studied along with the original formulation, and there are solvers for it. Also it is easy to modify the dynamic programming solution to handle this, let me know if it's unclear how and I'll update my answer with an explanation.

A detail question when applying genetic algorithm to traveling salesman

I read various stuff on this and understand the principle and concepts involved, however, none of paper mentions the details of how to calculate the fitness of a chromosome (which represents a route) involving adjacent cities (in the chromosome) that are not directly connected by an edge (in the graph).
For example, given a chromosome 1|3|2|8|4|5|6|7, in which each gene represents the index of a city on the graph/map, how do we calculate its fitness (i.e. the total sum of distances traveled) if, say, there is no direct edge/link between city 2 and 8. Do we follow some sort of greedy algorithm to work out a route between 2 and 8, and add the distance of this route to the total?
This problem seems pretty common when applying GA to TSP. Anyone who's done it before please share your experience. Thanks.
If there is no link between 2 and 8 on your graph, then any chromosome with 2|8 or 8|2 in it is invalid for the classical travelling salesman problem. If you find some other route between 2 and 8, you are probably going to violate the "visit each location once" requirement.
One really dodgy-but-pragmatic solution is to include edges between those nodes with incredibly high distances, or even +INF if your language supports it. That way, your standard minimizing fitness function will naturally prune them.
I think the original formulation of the problem includes edges between all nodes, so this is a non-issue.
This is the exact kind of problem, specialized Crossover and mutation methods have been applied for GA based solutions to TSP problems. See this question.
if the chromosone does not represent a valid solution then it is completly unfit to solve the problem. So depending on how you order fitness. ie if a lower number represents more fitness (possibly a good idea when fitness represents total cost) then you'd assign it a max value and break any further fitness calculation on that chromosone when you get to a gene sequence that is invalid.
(or vice versa, assign it a fitness of zero if a higher fitness means a chromosone is more fit for the job)
however as others have pointed out it could be better to ensure that invalid chromosones dont occur. However if that is itself an overly compex process then allowing them and ensuring that broken chromosones are unlikely to make it into successive generations could be an acceptable approach.

What's the most insidious way to pose this problem?

My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?

Resources