How do I find the longest path in a weighted graph? - algorithm

If I am given a data structure with currency conversion rates:
a list of currency relationships with exchange values. (INR - USD)
Then how can I find the best exchange rate from currency1 to currency2?
My thought process:
Method 1:
if I take the list of exchange values and convert it to a graph - adjacency list and a weight list ( since this seems to be like a weighted graph problem), I can use DFS to find all possible paths and then keep a track of the path that generates the highest exchange rate (so I will multiply every conversion rate that comes in the path and store it. whenever a path generates a better conversion rate then I update this variable, therefore I have the max)
Please comment on the correctness of this algorithm. Am I thinking correctly? Would this generate the correct result?
A problem I see right away is that this is very inefficient since it would take exponential time.
Method 2: Can I just negate all the conversions and use Bellman Ford? Since Bellman Ford is used to find least costing paths in a weighted graph.
Thanks. Any guidance would be truly appreciated

Your intuition is correct- you could use DFS, and it would give you the best exchange rate (the shortest path by weight), but it would be extremely slow for large graphs.
Your second method (Bellman Ford) is a much better idea. As you mention, you'll have to multiply the exchange rates / edge weights, rather than add them, but this shouldn't pose any issues.
I assume you already worked this out, but for anyone referencing this in the future- you cannot use Dijkstra's algorithm nor its descendants like A*, because the graph, in spirit, has negative cycles. You could find a conversion rate less than 1, and potentially exploit this to get an overall lower minimum conversion rate (which you then just invert the two currencies, for a a maximum conversion rate in the opposite direction).
A mathematical digression:
A way to see this more clearly- imagine we have a few conversion rates, between 3 pairs of currencies- A, B, and C. Assuming the units check out, the overall conversion rate R across these three exchanges would be R = A * B * C. Another way we could write this would be R = e ^ log(A * B * C), where e is Euler's number, and log() is the natural logarithm (we could just as well have used 10 and log10(), or any other base). Rewriting this using the rules of logarithms, we can get R = e ^ (log(A) + log(B) + log(C)), and finally log(R) = log(A) + log(B) + log(C).
Now, if we don't care about the actual value of R, just which is largest / smallest (or we're willing to perform some exponentiation to get it), we can just settle for computing log(R), or the log of the exchange rate. The benefit to this is that the weights, while transformed to their logarithms, are added together, not multiplied. This allows us to use traditional implementations of graph algorithms unchanged (we just give them log(weight) instead of weight). If we try to give it something that would normally be between 0 and 1, we see that log(x) actually becomes negative, exposing the true nature of that edge, and the potential negative cycles it may create.
Summary
You'll want to probably use Bellman-Ford, and you should be fine just replacing addition with multiplication. If you have an existing implementation at hand, but which utilizes addition to combine edge weights, you can easily cheat by passing it the log() of the edge weight instead, and things will work "automagically".

Related

Weighted sampling without replacement and negative weights

I have an unusual sampling problem that I'm trying to implement for a Monte Carlo technique. I am aware there are related questions and answers regarding the fully-positive problem.
I have a list of n weights w_1,...,w_n and I need to choose k elements, labelled s_1,...,s_k say. The probability distribution that I want to sample from is
p(s_1,...,s_k) = |w_s_1 + ... + w_s_k| / P_total
where P_total is a normalization factor (the sum of all possible p(s,...) without P_total). I don't really care about how the elements are ordered for my purpose.
Note that some of the w_i may be less than zero and the absolute magnitude signs above. With purely non-negative w_i this distribution is relatively straightforward by sampling without replacement - a tree method being the most efficient as far as I can tell. With some negative weights, though, I feel like I must resort to explicitly writing out each possibility and sampling from this exponentially large set. Any suggestions or insights would be appreciated!
Rejection sampling is worth a try. Compute the maximum weight of a sample (max of the abs of each of the k least and k greatest). Repeatedly generate a uniform random sample and accept it with probability equal to its weight over the maximum weight until a sample is accepted.

Suggestions for fragment proposal algorithm

I'm currently trying to solve the following problem, but am unsure which algorithm I should be using. Its in the area of mass identification.
I have a series of "weights", *w_i*, which can sum up to a total weight. The as-measured total weight has an error associated with it, so is thus inexact.
I need to find, given the total weight T, the closest k possible combinations of weights that can sum up to the total, where k is an input from the user. Each weight can be used multiple times.
Now, this sounds suspiciously like the bounded-integer multiple knapsack problem, however
it is possible to go over the weight, and
I also want all of the ranked solutions in terms of error
I can probably solve it using multiple sweeps of the knapsack problem, from weight-error->weight+error, by stepping in small enough increments, however it is possible if the increment is too large to miss certain weight combinations that could be used.
The number of weights is usually small (4 ->10 weights) and the ratio of the total weight to the mean weight is usually around 2 or 3
Does anyone know the names of an algorithm that might be suitable here?
Your problem effectively resembles the knapsack problem which is a NP-complete problem.
For really limited number of weights, you could run over every combinations with repetition followed by a sorting which gives you a quite high number of manipulations; at best: (n + k - 1)! / ((n - 1)! · k!) for the combination and n·log(n) for the sorting part.
Solving this kind of problem in a reasonable amount of time is best done by evolutionary algorithms nowadays.
If you take the following example from deap, an evolutionary algorithm framework in Python:
ga_knapsack.py, you realise that by modifying lines 58-59 that automatically discards an overweight solution for something smoother (a linear relation, for instance), it will give you solutions close to the optimal one in a shorter time than brute force. Solutions are already sorted for you at the end, as you requested.
As a first attempt I'd go for constraint programming (but then I almost always do, so take the suggestion with a pinch of salt):
Given W=w_1, ..., w_i for weights and E=e_1,.., e_i for the error (you can also make it asymmetric), and T.
Find all sets S (if the weights are unique, or a list) st sum w_1+e_1,..., w_k+e_k (where w_1, .., w_k \elem and e_1, ..., e_k \elem E) \approx T within some delta which you derive from k. Or just set it to some reasonably large value and decrease it as you are solving the constraints.
I just realise that you also want to parametrise the expression w_n op e_m over op \elem +, - (any combination of weights and error terms) and off the top of my head I don't know which constraint solver would allow you to do that. In any case, you can always fall back to prolog. It may not fly, especially if you have a lot of weights, but it will give you solutions quickly.

Finding fastest path at a cost, less or equal to a specified

Here's visualisation of my problem.
I've been trying to use djikstra on that however, It haven't worked.
The complication, as I see it, is that Dijkstra's algorithm throws away information that you need to keep around: if you are trying to get from A to E in
B
/ \
A D - E
\ /
C
And ABD is shorter than ACD, Dijkstra's will forget that ACD was ever a possibility (it uses ACD as the canonical route from A to D). But if ABD has a higher cost than ACD, and ABDE is above the quota while ACDE is below, the now eliminated ACD was correct. The problem is that Dijkstra's algorithm assumes that if one path is at least as long as another, it is weakly dominated: there is no reason to prefer it. And in one dimension of comparison, paths are weakly ordered: given any two paths, one weakly dominates the other.
But here we have two dimensions of comparison, and so ordering does not hold: one path can be shorter, the other cheaper. Since we can only discard dominated paths, we must keep all paths that do not already exceed the budget and are not dominated. I have put a bit of work into implementing this approach; it looks doable but cannot find an argument for a worst-case bound below exponential complexity (although normal performance should be much better, since in a sane graphs most paths are dominated).
You can also, as Billiska notes, use k-th shortest routes algorithms and then proceed through their results until you find one below the budget. That uses time O(m+ K*n*log(m/n)); but unless someone sees an upper bound on K such that K is guaranteed to include a path under the budget (if one exists), we need to set K to be the total number of paths, again yielding exponential complexity (although again a strategy of incrementally increasing K would likely yield a reasonable average runtime, at least if length and cost are reasonably correlated).
EDIT:
Complicating (perhaps fatally) the implementation of my proposed modification is that Dijkstra's algorithm relies on an ordering of the accessibility of nodes, such that we know that if we take the unexplored node to which we have the shortest path, we will never find a better route to it (since all other routes are already known to be longer). If that shortest route is also expensive, that need not hold; even after exploring a node, we must be prepared to update paths out of it on the basis of longer but cheaper routes into it. I suspect that this will prevent it from reaching polynomial time in the worst case.
Basically you need to find the first shortest-path, check if it works, then find the second shortest-path, check if it works, and so on...
Dijkstra's algorithm isn't designed to work with such task.
And just a Google search on this new definition of the problem,
I arrive at Stack Overflow question on finding kth-shortest-paths.
I haven't read into it yet, so don't ask me.
I hope this helps.
I think you can do it with Dijkstra, but you have to change the way you are calculating the tentative distance in each step. Instead of just taking into account the distance, consider also the cost. the new distance should be 2-d number (dist, cost), when you will choose what is the minimal distance you should take the one with minimal dist AND cost <= 6, that's it.
I hope this is correct.

Graph Theory: Calculating Clustering Coefficient

I'm doing some research and I've come to a point where I have calculate the clustering coefficient of a graph.
According to this paper directly related to my research:
The clustering coefficient C(p) is
defined as follows. Suppose that a
vertex v has kv neighbours; then at
most (kv * (kv-1)) / 2 edges can
exist between them (this occurs when
every neighbour of v is connected to
every other neighbour of v). Let Cv
denote the fraction of these allowable
edges that actually exist. Define C as
the average of Cv over all v
But this wikipedia article on the subject says differently:
C = (number of closed triplets) / (number of connected triples)
It seems to me that the latter is more computationally expensive.
So really my question is: are they equivalent?
It should be noted that the paper is cited by the Wikipedia article.
Thanks for your time.
The two formulas are not the same; they are two different ways in which the global clustering coefficient can be calculated.
One way is by averaging the clustering coefficients (C_i [1]) of all nodes (this is the method you quoted from Watts and Strogatz). However, in [2, p204] Newman argues that this method is less preferable than the second one (the one you got from wikipedia). He justifies by pointing how the value of the global clustering coeff can be dominated by nodes of low degree, due to C_i's denominator [1]. So, in a network with many nodes of low degrees, you end up with a large value for the global clustering coeff, which Newman argues would be unrepresentative.
However, many network studies (or, in my experience, at least many studies concerned with online social networks) seem to have used this method, so in order to be able to compare your results with theirs, you would require to use the same method. Furthermore, the critique raised by Newman does not affect the extent to which comparisons of global clustering coefficients can be made, provided the same method was used in measuring them.
The two formulae are different and were proposed at different moments in time. The one you quoted from Watt and Strogatz is older, which is perhaps why it seems to have been more commonly used. Newman also explains that the two formulae are far from equivalent, and shouldn't be used as such. He says they can give substantially different numbers for a given network, however doesn't explain why.
[1] C_i = (number of pairs of neighbours of i that are connected) / (number of pairs of neighbours of i)
[2] Newman, M.E.J.. Networks : an introduction. Oxford New York: Oxford University Press, 2010. Print.
Edit:
I am including here a series of calculations for the same ER random graph. You can see how the two methods give different results, even for undirected graphs. (done using Mathematica)
I think they're equivalent. The wiki page you link to gives a proof that the triples formulation is equivalent to the fraction of possible edges formulation when calculating the local clustering coefficient, i.e. calculated just at a vertex. From there it seems that you just need to show that
sum_v lambda(v)/tau(v) = 3 x # triangles / # connected triples
where lambda(v) is the number of triangles containing v, and tau(v) is the number of connected triples for which v is the middle vertex, i.e. adjacent to each of the other 2 edges.
Now each triangle gets counted three times in the numerator of the LHS. However, each connected triple is only counted once for the middle vertex on the LHS, so the denominators are the same.
I partially disagree with Whatang. These methods are only equivalent for undirected graphs. However for directed graphs they return different results. In my opinion the local clustering coefficient method is the correct one. Not to mention its less computationally expensive. For example
<-----
4 -----> 5
|<--||-->
| ||
|-> 6 -> 7
4(IN [5,6], OUT [5,6])
5(IN [4,6], OUT [4])
6(IN [4], OUT [4,5,7])
7(IN [6], OUT [])
central = 6
localCC = 2 / 4*3 = 1/6
globalCC = 1 / 3
I wouldn't trust that wikipedia article. The first formula you cited is currently defined as the Mean Clustering Coefficient, hence it is the mean of all local clustering coefficients for a graph g. This is in no way the same as the global clustering coefficient, as xk_id aptly put it.
there is a great page to learn the basics from!
http://www.learner.org/courses/mathilluminated/interactives/network/
all about cluster coefficients, small world and so on ...

What's the most insidious way to pose this problem?

My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?

Resources