I have an algorithm that takes a DAG graph that has n nodes and for every node, it does a binary search on its adjacency nodes. To the best of my knowledge, this would be a O(n log n) algorithm however since the n inside the log corresponds only to the adjacency of a node I was wondering if this would become rather O(n log m). By m I mean the m nodes adjacent to each node (which would intuitively and often be much less than n).
Why not O(n log m)? I would say O(n log m) doesn't make sense because m is not technically a size of the input, n is. Besides, worst-case scenario the m can be n since a node could easily be connected to all others. Correct?
There are two cases here:
m, the number of adjacent nodes is bounded by a constant C, and
m, the number of adjacent nodes is bounded only by n, the number of nodes
In the first case the complexity is O(n), because Log(C) is a constant. In the second case, it's O(n*log(n)) because of the reason that you explained in your question (i.e. "m can be n)).
Big O notation provides an upper bound on algorithm's complexity, so since m equals n in the worst case (n - 1 to be precise), the correct complexity would be O(n log n).
There are certainly DAGs where one node is connected to every other node. Another example would be a DAG with nodes number 0,1,2...n, where every node has an edge leading to all higher numbered nodes.
There is precedent for giving a complexity estimate which depends on more than one parameter - http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm quotes a cost of O(|E| + |V| log(|V|). In some cases this might be useful information.
It is correct that in the worst case of a graph, each node has n-1 neighbours, meaning that it is connected to everyone else, but if that was so for every node then it wont be an acyclic graph.
Therefore the average neighbours of each node is less than n.
The maximum number of edges in a DAG is: (n-1)n/2
If we look at each node, it will have an average of (n-1)/2 neighbours.
So your complexity would still remain O(n log n) in the worst case.
Related
Does kruskal have a lower bound? Since we sort the edges ..
Everywhere I see O(mlogn)
Kruskal's algorithm proceeds in two stages:
Sort the edges by weight from lowest to highest.
Add edges back when they don't close a cycle.
The runtime cost of step (1) depends on what sorting algorithm is used. For example, if you use quicksort, then step (1) will take Ω(m log n) time and O(m2) time. If you use mergesort, then step (1) will take Ω(m) time and O(m log n) time. If you use a radix sort, and the edge weights range from 0 to U, then step (1) will take Θ(m log U) time. But because this depends on the sorting algorithm used and the particulars of the data fed into the algorithm, we can't give a strong lower bound. (The best lower bound we could give would be Ω(m), since you have to process each edge at least once.)
The runtime cost of step (2) is O(mα(m, n)), where α(m, n) is the Ackermann inverse function, and there is a matching lower bound of Ω(mα(m, n)) here.
So overall the cost of Kruskal's algorithm is "the cost of sorting, plus Θ(mα(m, n))."
What is the difference, regarding runtime complexity, between the following and why?:
(1) DIJKSTRA's algorithm using regular Priority Queue (Heap)
(2) DIJKSTRA's algorithm using a doubly linked list
(Unless there isn't a difference)
The most general version of Dijkstra's algorithm assumes that you have access to some sort of priority queue structure that supports the following operations:
make-heap(s, n): build a heap of n nodes at initial distance ∞, except for the start node s, which has distance 0.
dequeue-min(): remove and return the element with the lowest priority.
decrease-key(obj, key): given an existing object obj in the priority queue, reduce its priority to the level given by key.
Dijkstra's algorithm's requires one call to make-heap, O(n) calls to dequeue-min, and O(m) calls to decrease-key, where n is the number of nodes and m is the number of edges. The overall runtime can actually be given as O(Tm-h + nTdeq + mTd-k), where Tm-h, Tdeq, and Td-k are the average (amortized) costs of doing an make-heap, a dequeue, and a decrease-key, respectively.
Now, let's suppose that your priority queue is a doubly-linked list. There's actually several ways you could use a doubly-linked list as a priority queue: you could keep the nodes sorted by distance, or you could keep them in unsorted order. Let's consider each of these.
In a sorted doubly-linked list, the cost of doing a make-heap is O(n): just insert the start node followed by n - 1 other nodes at distance infinity. The cost of doing a dequeue-min is O(1): just delete the first element. However, the cost of doing a decrease-key is O(n), since if you need to change a node's priority, you may have to move it, and you can't find where to move it without (in the worst case) doing a linear scan over the nodes. This means that the runtime will be O(n + n + nm) = O(mn).
In an unsorted doubly-linked list, the cost of doing a make-heap is still O(n) because you need to create n different nodes. The cost of a dequeue-min is now O(n) because you have to do a linear scan over all the nodes in the list to find the minimum value. However, the cost of a decrease-key is now O(1), since you can just update the node's key in-place. This means that the runtime is O(n + n2 + m) = O(n2 + m) = O(n2), since the number of edges is never more than O(n2). This is an improvement from before.
With a binary heap, the cost of doing a make-heap is O(n) if you use the standard linear-time heapify algorithm. The cost of doing a dequeue is O(log n), and the cost of doing a decrease-key is O(log n) as well (just bubble the element up until it's in the right place). This means that the runtime of Dijkstra's algorithm with a binary heap is O(n + n log n + m log n) = O(m log n), since if the graph is connected we'll have that m ≥ n.
You can do even better with a Fibonacci heap, in an asymptotic sense. It's a specialized priority queue invented specifically to make Dijkstra's algorithm fast. It can do a make-heap in time O(n), a dequeue-min in time O(log n), and a decrease-key in (amortized) O(1) time. This makes the runtime of Dijkstra's algorithm O(n + n log n + m) = O(m + n log n), though in practice the constant factors make Fibonacci heaps slower than binary heaps.
So there you have it! The different priority queues really do make a difference. It's interesting to see how "Dijkstra's algorithm" is more of a family of algorithms than a single algorithm, since the choice of data structure is so critical to the algorithm running quickly.
So I'm teaching myself some graph algorithms, now on Kruskal's, and understand that it's recommended to use union-find so checking whether adding an edge creates a cycle only takes O(Log V) time. For practical purposes, I see why you'd want to, but strictly looking through Big O notation, does doing so actually affect the worst-case complexity?
My reasoning: If instead of union find, we did a DFS to check for cycles, the runtime for that would be O(E+V), and you have to perform that V times for a runtime of O(V^2 + VE). It's more than with union find, which would be O(V * LogV), but the bulk of the complexity of Kruskal's comes from deleting the minimum element of the priority queue E times, which is O(E * logE), the Big O answer. I don't really see a space advantage either since the union-find takes O(V) space and so too do the data structures you need to maintain to find a cycle using DFS.
So a probably overly long explanation for a simple question: Does using union-find in Kruskal's algorithm actually affect worst-case runtime?
and understand that it's recommended to use union-find so checking whether adding an edge creates a cycle only takes O(Log V) time
This isn't right. Using union find is O(alpha(n) * m), where alpha(n) is the inverse of the Ackermann function, and, for all intents and purposes, can be considered constant. So much faster than logarithmic:
Since alpha(n) is the inverse of this function, alpha(n) is less than 5 for all remotely practical values of n. Thus, the amortized running time per operation is effectively a small constant.
but the bulk of the complexity of Kruskal's comes from deleting the minimum element of the priority queue E times
This is also wrong. Kruskal's algorithm does not involve using any priority queues. It involves sorting the edges by cost at the beginning. Although the complexity remains the one you mention for this step. However, sorting might be faster in practice than a priority queue (using a priority queue will, at best, be equivalent to a heap sort, which is not the fastest sorting algorithm).
Bottom line, if m is the number of edges and n the number of nodes.:
Sorting the edges: O(m log m).
For each edge, calling union-find: O(m * alpha(n)), or basically just O(m).
Total complexity: O(m log m + m * alpha(n)).
If you don't use union-find, total complexity will be O(m log m + m * (n + m)), if we use your O(n + m) cycle finding algorithm. Although O(n + m) for this step is probably an understatement, since you must also update your structure somehow (insert an edge). The naive disjoint-set algorithm is actually O(n log n), so even worse.
Note: in this case, you can write log n instead of log m if you prefer, because m = O(n^2) and log(n^2) = 2log n.
In conclusion: yes, union-find helps a lot.
Even if you use the O(log n) variant of union-find, which would lead to O(m log m + m log n) total complexity, which you could assimilate to O(m log m), in practice you'd rather keep the second part faster if you can. Since union-find is very easy to implement, there's really no reason not to.
The background
According to Wikipedia and other sources I've found, building a binary heap of n elements by starting with an empty binary heap and inserting the n elements into it is O(n log n), since binary heap insertion is O(log n) and you're doing it n times. Let's call this the insertion algorithm.
It also presents an alternate approach in which you sink/trickle down/percolate down/cascade down/heapify down/bubble down the first/top half of the elements, starting with the middle element and ending with the first element, and that this is O(n), a much better complexity. The proof of this complexity rests on the insight that the sink complexity for each element depends on its height in the binary heap: if it's near the bottom, it will be small, maybe zero; if it's near the top, it can be large, maybe log n. The point is that the complexity isn't log n for every element sunk in this process, so the overall complexity is much less than O(n log n), and is in fact O(n). Let's call this the sink algorithm.
The question
Why isn't the complexity for the insertion algorithm the same as that of the sink algorithm, for the same reasons?
Consider the actual work done for the first few elements in the insertion algorithm. The cost of the first insertion isn't log n, it's zero, because the binary heap is empty! The cost of the second insertion is at worst one swap, and the cost of the fourth is at worst two swaps, and so on. The actual complexity of inserting an element depends on the current depth of the binary heap, so the complexity for most insertions is less than O(log n). The insertion cost doesn't even technically reach O(log n) until after all n elements have been inserted [it's O(log (n - 1)) for the last element]!
These savings sound just like the savings gotten by the sink algorithm, so why aren't they counted the same for both algorithms?
Actually, when n=2^x - 1 (the lowest level is full), n/2 elements may require log(n) swaps in the insertion algorithm (to become leaf nodes). So you'll need (n/2)(log(n)) swaps for the leaves only, which already makes it O(nlogn).
In the other algorithm, only one element needs log(n) swaps, 2 need log(n)-1 swaps, 4 need log(n)-2 swaps, etc. Wikipedia shows a proof that it results in a series convergent to a constant in place of a logarithm.
The intuition is that the sink algorithm moves only a few things (those in the small layers at the top of the heap/tree) distance log(n), while the insertion algorithm moves many things (those in the big layers at the bottom of the heap) distance log(n).
The intuition for why the sink algorithm can get away with this that the insertion algorithm is also meeting an additional (nice) requirement: if we stop the insertion at any point, the partially formed heap has to be (and is) a valid heap. For the sink algorithm, all we get is a weird malformed bottom portion of a heap. Sort of like a pine tree with the top cut off.
Also, summations and blah blah. It's best to think asymptotically about what happens when inserting, say, the last half of the elements of an arbitrarily large set of size n.
While it's true that log(n-1) is less than log(n), it's not smaller by enough to make a difference.
Mathematically: The worst-case cost of inserting the i'th element is ceil(log i). Therefore the worst-case cost of inserting elements 1 through n is sum(i = 1..n, ceil(log i)) > sum(i = 1..n, log i) = log 1 + log 1 + ... + log n = log(1 × 2 × ... × n) = log n! = O(n log n).
Ran into the same problem yesterday. I tried coming up with some form of proof to satisfy myself. Does this make any sense?
If you start inserting from the bottom, The leaves will have constant time insertion- just copying it into the array.
The worst case running time for a level above the leaves is:
k*(n/2h)*h
where h is the height (leaves being 0, top being log(n) ) k is a constant( just for good measure ). So (n/2h) is the number of nodes per level and h is the MAXIMUM number of 'sinking' operations per insert
There are log(n) levels,
Hence, The total running time will be
Sum for h from 1 to log(n): n* k* (h/2h)
Which is k*n * SUM h=[1,log(n)]: (h/2h)
The sum is a simple Arithmetico-Geometric Progression which comes out to 2.
So you get a running time of k*n*2, Which is O(n)
The running time per level isn't strictly what i said it was but it is strictly less than that.Any pitfalls?
I am comparing two algorithms, Prim's and Kruskal's.
I understand the basic concept of time complexity and when the two work best (sparse/dense graphs)
I found this on the Internet, but I am struggling to convert it to English.
dense graph: Prim = O(N2)
Kruskal = O(N2*log(N))
sparse graph: Prim = O(N2)
Kruskal = O(N log(N))
It's a bit of a long shot, but could anyone explain what is going on here?
Prim is O(N^2), where N is the number of vertices.
Kruskal is O(E log E), where E is the number of edges. The "E log E" comes from a good algorithm sorting the edges. You can then process it in linear E time.
In a dense graph, E ~ N^2. So Kruskal would be O( N^2 log N^2 ), which is simply O( N^2 log N ).
OK, here goes. O(N2) (2 = squared) means that the speed of the algorithm for large N varies as the square of N - so twice the size of graph will result in four times the time to compute.
The Kruskal rows are merely simplified, and assume that E = c * N2. c here is presumably a constant, that we can assume to be significantly smaller than N as N gets large. You need to know the following laws of logarithms: log(ab) = log a + log b and log(a^n) = n * log a. These two combined with the fact that log c << log N (is much less than and can be ignored) should let you understand the simplifications there.
Now, as for the original expressions and where they were derived from, you'd need to check the page you got these from. But I'm assuming that if you're looking at Prim's and Kruskal's then you will be able to understand the derivation, or at least that if you can't my explaining it to you is not actually going to help you in the long run...
Kruskal is sensitive to the number of edges (E) in a graph, not the number of nodes.
Prim however is only affected by the number of nodes (N), evaluting to O(N^2).
This means that in dense graphs where the number of edges approaches N^2 (all nodes connected) it's complexity factor of O(E*log(E)) is roughly equivalent to O(N^2*log(N)).
The c is a constant to account for the 'almost' and is irrelevant in O notation. Also log(N^2) is of the same order of magnitude as log(N) as logarithm outweighs the power of 2 by a substantial margin ( log(N^2) => 2*log(N) which in O notation is O(log(N)) ).
In a sparse graph E is closer to N giving you O(N*log(N)).
The thought is that in a dense graph, the number of edges is O(N^2) while in sparse graphs, the number of edges is O(N). So they're taking the O(E \lg E) and expanding it with this approximation of E in order to compare it directly to the running time of Prim's O(N^2).
Basically, it's showing that Kruskal's is better for sparse graphs and Prim's is better for dense graphs.
The two algorithms have big-O defined for different inputs (nodes and edges). So they are converting one to the other to compare them.
N is the number nodes in the graph E is the number of edges.
for a dense graph there are O(N^2) Edges
for a sparse graph there are O(N) Edges.
constants are of course irrelavent for big-O hence the c drops out
First: n is the number of vertices.
Prim is O(n^2) that part is easy enough.
Kruskal is O(Elog(E)) where E is the number of edges. in a dense graph, there are as many as N choose 2 edges, which is roughly n^2 (actually it's n(n-1)/2, but who's counting?) so, it's roughly n^2 log (n^2) which is 2n^2 log n which is O(n^2logn) which is bigger than O(n^2)
In a sparse graph, there are as few as n edges, so we have n log n which is less than O(n^2).