Worst case memory for Graph data structure

Worst case memory for Graph data structure - algorithm

I could see in many books that the worst case memory requirement for graphs is O(V). But, If I have not mistaken, graphs are usually represented as adjacency matrix and not by the creation of nodes ( as in linked list / trees ). So, for a graph containing 5 vertices, I need 5x5 matrix which is O(V^2). They why do they say it as O(V)?
Am I missing something somewhere? Sorry if the question is too naive.

The three main ways of representing a graph are:
Adjacency matrix - Θ(|V|²) space.
Adjacency list - Θ(|V| + |E|) space.
Collection of node objects/structs with pointers to one another - This is basically just another way of representing an adjacency list. Θ(|V| + |E|). (Remember that pointers require memory too.)
Since we're talking worst case, all of these reduce to Θ(|V|²) since that's the maximum number of edges in a graph.
I'm guessing you misread the book. They probably weren't talking about the space required to store the graph structure itself, but rather the amount of extra space required for some graph algorithm.

If what you say is true, it's possible they are referring to other ways to represent a graph, other than using an adjacency matrix and are possibly making an edge-density assumption. One way is, for each vertex, just store a list of pointers / references to its neighbors (called an adjacency list). This would be O(|V| + |E|). If we assume |E| ~ |V|, which is an assumption we do sometimes see, then we have O(|V|) space. But note that in the worst-case, |E| ~ |V|^2 and so even this approach to representing a graph is O(|V|^2) in the worst case.
Look, it's quite simple; there's no escaping the fact that in the worst case |E| ~ |V|^2. There can not possibly be, in general, a representation of E that in the worst case is not O(|V|^2).
But, it'd be nice to have an exact quote to work with. This is important. We don't want to find ourselves tearing apart your misunderstanding of a correct statement.

Related

Space complexity of Adjacency List representation of Graph

I read here that for Undirected graph the space complexity is O(V + E) when represented as a adjacency list where V and E are number of vertex and edges respectively.
My analysis is, for a completely connected graph each entry of the list will contain |V|-1 nodes then we have a total of |V| vertices hence, the space complexity seems to be O(|V|*|V-1|) which seems O(|V|^2) what I am missing here?

You analysis is correct for a completely connected graph. However, note that for a completely connected graph the number of edges E is O(V^2) itself, so the notation O(V+E) for the space complexity is still correct too.
However, the real advantage of adjacency lists is that they allow to save space for the graphs that are not really densely connected. If the number of edges is much smaller than V^2, then adjacency lists will take O(V+E), and not O(V^2) space.
Note that when you talk about O-notation, you usually have three types of variables (or, well, input data in general). First is the variables dependence on which you are studying; second are those variables that are considered constant; and third are kind of "free" variables, which you usually assume to take the worst-case values. For example, if you talk about sorting an array of N integers, you usually want to study the dependence of sorting time on N, so N is of the first kind. You usually consider the size of integers to be constant (that is, you assume that comparison is done in O(1), etc.), and you usually consider the particular array elements to be "free", that is, you study that runtime for the worst possible combination of particular array elements. However, you might want to study the same algorithm from a different point of view, and it will lead to a different expression of complexity.
For graph algorithms, you can, of course, consider the number of vertices V to be of first kind, and the number of edges to be the third kind, and study the space complexity for given V and for the worst-case number of edges. Then you indeed get O(V^2). But it is also often useful to treat both V and E as variables of the first type, thus getting the complexity expression as O(V+E).

Size of array is |V| (|V| is the number of nodes). These |V| lists each have the degree which is denoted by deg(v). We add up all those, and apply the Handshaking Lemma.
∑deg(v)=2|E| .
So, you have |V| references (to |V| lists) plus the number of nodes in the lists, which never exceeds 2|E| . Therefore, the worst-case space (storage) complexity of an adjacency list is O(|V|+2|E|)= O(|V|+|E|).
Hope this helps

according to u r logic, total space = O(v^2-v)
as total no. of connections(E) = v^2 - v,
space = O(E).
even I used to think the same,
But if you have total no. of connections(E) much lesser than no. of vertices(V),
lets say out of 10 people (V=10),
only 2 knows each other, therefore (E=2)
then according to your logic
space = O(E) = O(2)
but we in actual we have to allocate much larger space
which is
space = O(V+E) = O(V+2) = O(V)
that's why,
space = O(V+E), this will work if V > E or E > V

Graph Implementations: why not use hashing?

I'm doing interview prep and reviewing graph implementations. The big ones I keep seeing are adjacency list and adjacency matrices. When we consider the runtime of basic operations, why do I never see data structures with hashing used?
In Java, for instance, an adjacency list is typically ArrayList<LinkedList<Node>>, but why don't people use HashMap<Node, HashSet<Node>>?
Let n = number of nodes and m = number of edges.
In both implementations, removing a node v involves searching through all of the collections and removing v. In the adjacency list, that's O(n^2), but in the "adjacency set", it's O(n). Likewise, removing an edge involves removing node u from v's list and node v from u's list. In the adjacency list , that's O(n), while in the adjacency set, it's O(1). Other operations, such as finding a nodes successors, finding if there exists a path between two nodes, etc. are the same with both implementations. The space complexities are also both O(n + m).
The only downside to the adjacency set I can think of is that adding nodes/edges is amortized O(1), while doing so in the adjacency list is truly O(1).
Perhaps I'm not seeing something or I forgot to consider things when calculating the runtimes, so please let me know.

In the same vein of thought as DavidEisenstat's answer, graph implementations vary a lot. That's one of the things that doesn't come across well in lecture. There are two conceptual designs:
1) Adjacency list
2) Adjacency matrix
But you can easily augment either design to gain properties like faster insertion/removal/searches. The price is often just storing extra data! Consider implementing a relatively simple graph algorithm (like... Euler's) and see how your graph implementation causes huge effects on the run-time complexity.
To make my point a bit clearer, I'm saying that an "adjacency list" doesn't really require you to use a LinkedList. For instance, wiki cites this on their page:
An implementation suggested by Guido van Rossum uses a hash table to associate
each vertex in a graph with an array of adjacent vertices. In this
representation, a vertex may be represented by any hashable object. There is
no explicit representation of edges as objects.

We probably don't usually see this representation because checking if an arbitrary edge is in a graph is rarely needed (I can't think of any everyday graph algorithm that relies on that), and where it is needed, we can use just one hash map for the whole graph, storing pairs (v1, v2) to represent the edges. This seems more efficient.
(Most of the common graph algorithms say something like "for every neighbour of vertex v, do ...", and then an adjacency list is perfect.)

why don't people use HashMap<Node, HashSet<Node>>?
Unless there are multiple graphs on the same set of nodes, the HashMap can be replaced by a member variable of Node.
The question of HashSet versus LinkedList is more interesting. I would guess that, for sparse graphs, LinkedList would be more efficient both in time (for operations of equivalent asymptotic complexity) and in space. I don't have much experience with either representation, because depending on the algorithm requirements I usually prefer either to (i) store the adjacency lists as consecutive subarrays or (ii) have for each edge an explicit object or pair of objects that stores information about the edge (e.g., weight) and participates in two circular doubly linked lists (my own implementation, because the Java and C++ standard libraries do not support intrusive data structures), making node deletion proportional to the degree of the node and edge deletion O(1).
The running times you quote for the hashes are not worst-case, only high-probability against an oblivious adversary, though they can be unamortized at the cost of further degrading the constant factors.

Many theory problems involve a fixed set of vertices and edges - there's no removal.
Many / most graph algorithms involve either simply iterating through all edges in the adjacency list or something more complex (for which an additional data structure is required).
Given the above, you get all of the advantages of an array (e.g. O(1) random access, space efficient) to represent vertices with none of the disadvantages (e.g. fixed size, O(n) search / index insert / remove), and all the advantages of a linked-list (e.g. O(1) insert, space efficient for unknown number of elements) to represent edges with none of the disadvantages (O(n) search / random access).
But... what about hashing?
Sure, hashing has comparable efficiency for the required operations, but the constant factors are worse and there's an unpredictability since the performance is dependent on a good hash function and well-distributed data.
Now it's not a rule that you shouldn't use hashing, if your problem calls for it, go for it.

Check if 2 tree nodes are related (ancestor/descendant) in O(1) with pre-processing

Check if 2 tree nodes are related (i.e. ancestor-descendant)
solve it in O(1) time, with O(N) space (N = # of nodes)
pre-processing is allowed
That's it. I'll be going to my solution (approach) below. Please stop if you want to think yourself first.
For a pre-processing I decided to do a pre-order (recursively go through the root first, then children) and give a label to each node.
Let me explain the labels in details. Each label will consist of comma-separated natural numbers like "1,2,1,4,5" - the length of this sequence equals to (the depth of the node + 1). E.g. the label of the root is "1", root's children will have labels "1,1", "1,2", "1,3" etc.. Next-level nodes will have labels like "1,1,1", "1,1,2", ..., "1,2,1", "1,2,2", ...
Assume that "the order number" of a node is the "1-based index of this node" in the children list of its parent.
Common rule: node's label consists of its parent label followed by comma and "the order number" of the node.
Thus, to answer if two nodes are related (i.e. ancestor-descendant) in O(1), I'll be checking if the label of one of them is "a prefix" of the other's label. Though I'm not sure if such labels can be considered to occupy O(N) space.
Any critics with fixes or an alternative approach is expected.

You can do it in O(n) preprocessing time, and O(n) space, with O(1) query time, if you store the preorder number and postorder number for each vertex and use this fact:
For two given nodes x and y of a tree T, x is an ancestor of y if and
only if x occurs before y in the preorder traversal of T and after y
in the post-order traversal.
(From this page: http://www.cs.arizona.edu/xiss/numbering.htm)
What you did in the worst case is Theta(d) where d is the depth of the higher node, and so is not O(1). Space is also not O(n).

if you consider a tree where a node in the tree has n/2 children (say), the running time of setting the labels will be as high as O(n*n). So this labeling scheme wont work ....

There are linear time lowest common ancestor algorithms(at least off-line). For instance have a look here. You can also have a look at tarjan's offline LCA algorithm. Please note that these articles require that you know the pairs for which you will be performing the LCA in advance. I think there are also online linear time precomputation time algorithms but they are very complex. For instance there is a linear precomputation time algorithm for the range minimum query problem. As far as I remember this solution passed through the LCA problem twice . The problem with the algorithm is that it had such a large constant that it require enormous input to be actually faster then the O(n*log(n)) algorithm.
There is much simpler approach that requires O(n*log(n)) additional memory and again answers in constant time.
Hope this helps.

FInding a lower bound for a nlogn algorithm

The original problem was discussed in here: Algorithm to find special point k in O(n log n) time
Simply we have an algorithm that finds whether a set of points in the plane has a center of symmetry or not.
I wonder is there a way to prove a lower bound (nlogn) to this algorithm? I guess we need to use this algorithm to solve a simplier problem, such as sorting, element uniqueness, or set uniqueness, therefore we can conclude that if we can solve e.g. element uniqueness by using this algorithm, it can be at least nlogn.
It seems like the solution is something to do with element uniqueness, but i couldn't figure out a way to manipulate this into center of symmetry algorithm.

Check this paper
The idea is if we can reduce problem A to problem B, then B is no harder than A.
That said, if problem B has lower bound Ω（nlogn）, then problem A is guaranteed the same lower bound.
In the paper, the author picked the following relatively approachable problem to be B: given two sets of n real numbers, we wish to decide whether or not they are identical.
It's obvious that this introduced problem has lower bound Ω（nlogn）. Here's how the author reduced our problem at hand to the introduced problem (A, B denote the two real sets in the following context):

First observe that that your magical point k must be in the center.
build a lookup data structure indexed by vector position (O(nlog n))
calculate the centroid of the set of points (O(n))
for each point, calculate the vector position of its opposite and check for its existence in the lookup structure (O(log n) * n)
Appropriate lookup data structures can include basically anything that allows you to look something up efficiently by content, including balanced trees, oct-trees, hash tables, etc.

Algorithms with superexponential runtime?

I was talking with a student the other day about the common complexity classes of algorithms, like O(n), O(nk), O(n lg n), O(2n), O(n!), etc. I was trying to come up with an example of a problem for which solutions whose best known runtime is super-exponential, such as O(22n), but still decidable (e.g. not the halting problem!) The only example I know of is satisfiability of Presburger arithmetic, which I don't think any intro CS students would really understand or be able to relate to.
My question is whether there is a well-known problem whose best known solution has runtime that is superexponential; at least ω(n!) or ω(nn). I would really hope that there is some "reasonable" problem meeting this description, but I'm not aware of any.

Maximum Parsimony is the problem of finding an evolutionary tree connecting n DNA sequences (representing species) that requires the fewest single-nucleotide mutations. The n given sequences are constrained to appear at the leaves; the tree topology and the sequences at internal nodes are what we get to choose.
In more CS terms: We are given a bunch of length-k strings that must appear at the leaves of some tree, and we have to choose a tree, plus a length-k string for each internal node in the tree, so as to minimise the sum of Hamming distances across all edges.
When a fixed tree is also given, the optimal assignment of sequences to internal nodes can be determined very efficiently using the Fitch algorithm. But in the usual case, a tree is not given (i.e. we are asked to find the optimal tree), and this makes the problem NP-hard, meaning that every tree must in principle be tried. Even though an evolutionary tree has a root (representing the hypothetical ancestor), we only need to consider distinct unrooted trees, since the minimum number of mutations required is not affected by the position of the root. For n species there are 3 * 5 * 7 * ... * (2n-5) leaf-labelled unrooted binary trees. (There is just one such tree with 3 species, which has a single internal vertex and 3 edges; the 4th species can be inserted at any of the 3 edges to produce a distinct 5-edge tree; the 5th species can be inserted at any of these 5 edges, and so on -- this process generates all trees exactly once.) This is sometimes written (2n-5)!!, with !! meaning "double factorial".
In practice, branch and bound is used, and on most real datasets this manages to avoid evaluating most trees. But highly "non-treelike" random data requires all, or almost all (2n-5)!! trees to be examined -- since in this case many trees have nearly equal minimum mutation counts.

Showing all permutation of string of length n is n!, finding Hamiltonian cycle is n!, minimum graph coloring, ....
Edit: even faster Ackerman functions. In fact they seems without bound function.
A(x,y) = y+1 (if x = 0)
A(x,y) = A(x-1,1) (if y=0)
A(x,y) = A(x-1, A(x,y-1)) otherwise.
from wiki:
A(4,3) = 2^2^65536,...

Do algorithms to compute real numbers to a certain precision count? The formula for the area of the Mandelbrot set converges extremely slowly; 10118 terms for two digits, 101181 terms for three.

This is not a practical everyday problem, but it's a way to construct relatively straightforward problems of increasing complexity.
The Kolmogorov complexity K(x) is the size of the smallest program that outputs the string $x$ on a pre-determined universal computer U. It's easy to show that most strings cannot be compressed at all (since there are more strings of length n than programs of length n).
If we give U a maximum running time (say some polynomial function P), we get a time-bounded Kolmogorov complexity. The same counting argument holds: there are some strings that are incompressible under this time bounded Kolmogorov complexity. Let's call the first such string (of some length n) xP
Since the time-bounded Kolmogorov complexity is computable, we can test all strings, and find xP
Finding xP can't be done in polynomial time, or we could use this algorithm to compress it, so finding it must be a super-polynomial problem. We do know we can find it in exp(P) time, though. (Jumping over some technical details here)
So now we have a time-bound E = exp(P). We can repeat the procedure to find xE, and so on.
This approach gives us a decidable super-F problem for every time-constructible function F: find the first string of length n (some large constant) that is incompressible under time-bound F.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio