I'm doing interview prep and reviewing graph implementations. The big ones I keep seeing are adjacency list and adjacency matrices. When we consider the runtime of basic operations, why do I never see data structures with hashing used?
In Java, for instance, an adjacency list is typically ArrayList<LinkedList<Node>>, but why don't people use HashMap<Node, HashSet<Node>>?
Let n = number of nodes and m = number of edges.
In both implementations, removing a node v involves searching through all of the collections and removing v. In the adjacency list, that's O(n^2), but in the "adjacency set", it's O(n). Likewise, removing an edge involves removing node u from v's list and node v from u's list. In the adjacency list , that's O(n), while in the adjacency set, it's O(1). Other operations, such as finding a nodes successors, finding if there exists a path between two nodes, etc. are the same with both implementations. The space complexities are also both O(n + m).
The only downside to the adjacency set I can think of is that adding nodes/edges is amortized O(1), while doing so in the adjacency list is truly O(1).
Perhaps I'm not seeing something or I forgot to consider things when calculating the runtimes, so please let me know.
In the same vein of thought as DavidEisenstat's answer, graph implementations vary a lot. That's one of the things that doesn't come across well in lecture. There are two conceptual designs:
1) Adjacency list
2) Adjacency matrix
But you can easily augment either design to gain properties like faster insertion/removal/searches. The price is often just storing extra data! Consider implementing a relatively simple graph algorithm (like... Euler's) and see how your graph implementation causes huge effects on the run-time complexity.
To make my point a bit clearer, I'm saying that an "adjacency list" doesn't really require you to use a LinkedList. For instance, wiki cites this on their page:
An implementation suggested by Guido van Rossum uses a hash table to associate
each vertex in a graph with an array of adjacent vertices. In this
representation, a vertex may be represented by any hashable object. There is
no explicit representation of edges as objects.
We probably don't usually see this representation because checking if an arbitrary edge is in a graph is rarely needed (I can't think of any everyday graph algorithm that relies on that), and where it is needed, we can use just one hash map for the whole graph, storing pairs (v1, v2) to represent the edges. This seems more efficient.
(Most of the common graph algorithms say something like "for every neighbour of vertex v, do ...", and then an adjacency list is perfect.)
why don't people use HashMap<Node, HashSet<Node>>?
Unless there are multiple graphs on the same set of nodes, the HashMap can be replaced by a member variable of Node.
The question of HashSet versus LinkedList is more interesting. I would guess that, for sparse graphs, LinkedList would be more efficient both in time (for operations of equivalent asymptotic complexity) and in space. I don't have much experience with either representation, because depending on the algorithm requirements I usually prefer either to (i) store the adjacency lists as consecutive subarrays or (ii) have for each edge an explicit object or pair of objects that stores information about the edge (e.g., weight) and participates in two circular doubly linked lists (my own implementation, because the Java and C++ standard libraries do not support intrusive data structures), making node deletion proportional to the degree of the node and edge deletion O(1).
The running times you quote for the hashes are not worst-case, only high-probability against an oblivious adversary, though they can be unamortized at the cost of further degrading the constant factors.
Many theory problems involve a fixed set of vertices and edges - there's no removal.
Many / most graph algorithms involve either simply iterating through all edges in the adjacency list or something more complex (for which an additional data structure is required).
Given the above, you get all of the advantages of an array (e.g. O(1) random access, space efficient) to represent vertices with none of the disadvantages (e.g. fixed size, O(n) search / index insert / remove), and all the advantages of a linked-list (e.g. O(1) insert, space efficient for unknown number of elements) to represent edges with none of the disadvantages (O(n) search / random access).
But... what about hashing?
Sure, hashing has comparable efficiency for the required operations, but the constant factors are worse and there's an unpredictability since the performance is dependent on a good hash function and well-distributed data.
Now it's not a rule that you shouldn't use hashing, if your problem calls for it, go for it.
Related
Skiena's Algorithm Design Manual (3 ed, p. 204) refers to adjacency lists as opposed to general adjacency representations, defining them as assigning to each vertex a a singly linked list L_a with underlying set set(L_a) = {b | (x, b) <- edges, x == a}.
I'm surprised that Skiena presents the singly linked list as the definitive data structure implementing the collections L_a. My impression is that linked lists are generally losing favor compared with arrays and hash tables, because:
They are not cache-friendly to iterate over (like arrays are), and the gap between processor speed and main memory access has become more important. (For instance this video (7m) by Stroustrup.)
They don't bring much to the table, particularly when order isn't important. The advantage of linked lists over arrays is that they admit constant-time add and delete. But in the case where we don't care about order, these can be constant-time operations on arrays as well, using "swap and pop" for deletes. A hash table would have the additional advantage of constant-time search. My understanding is that hash tables cost more memory than a linked list or an array, but this consideration has become relatively less important. (Perhaps this claim isn't meaningful in the absence of a specific application.)
Other sources treat adjacency lists differently. For instance Wikipedia presents an implementation where the L_a are arrays. And in Stone's Algorithms for Functional Programming the L_a are unordered sets, implemented ultimately as Scheme lists (which in turn struck me as strange).
My Question: Is there a consideration I'm missing which gives singly linked lists a significant advantage in adjacency representations?
I add an earnest request that before you vote to close this question, or post a comment with an uncharitable tone, you ask yourself whether you are really helping this site achieve its goals by doing so.
I don't think there's any general agreement on singly-linked lists as the default representation of adjacency lists in most real-world use cases.
A singly-linked list is, however, pretty much the most restrictive implementation of an adjacency list you could have, so in a book about "Algorithm Design", it makes sense to think of adjacency lists in this representation unless you need something special from them, like random access, bidirectional iteration, binary search, etc.
When it comes to practical implementations of algorithms on explicit graphs (most implementations are on implicit graphs), I don't think singly-linked lists are usually a good choice.
My go-to adjacency list graph representation is a pair of parallel arrays:
Vertexes are numbered from 0 to n-1
There is an edge array that contains all of the edges sorted by their source vertex number. For an undirected graph, each edge appears in here twice. The source vertices often don't need to be stored in here.
There is a vertex array that stores, for each vertex, the end position of its edges in the edge array.
This is a nice, compact, cache-friendly representation that is easy to work with and requires only two allocations.
I can usually find an easy way to construct a graph like this in linear time by first filling the vertex array with counts, then changing the counts to start positions (shifted cumulative sum), and then populating the edge array, advancing those positions as edges are added.
I'm preparing to attend technical interviews and have faced mostly questions which are situation based.Often the situation is a big dataset and I'm asked to decide which will be the most optimal data structure to use.
I'm familiar with most data structures,their implementation and performance. But I fall under dilemma when given situations and be decisive on structures.
Looking for steps/algorithm to follow in a given situation which can help me arrive at the optimum data structure within the time period of the interview.
It depends on what operations you need to support efficiently.
Let's start from the simplest example - you have a large list of elements and you have to find the given element. Lets consider various candidates
You can use sorted array to find an element in O(log N) time using Binary search. What if you want to support insertion and deletion along with that? Inserting an element into a sorted array takes O(n) time in the worst case. (Think of adding an element in the beginning. You have to shift all the elements one place to the right). Now here comes binary search trees (BST). They can support insertion, deletion and searching for an element in O(log N) time.
Now you need to support two operations namely finding minimum and maximum. In the first case, it is just returning the first and the last element respectively and hence the complexity is O(1). Assuming the BST is a balanced one like Red-black tree or AVL tree, finding min and max needs O(log N) time. Consider another situation where you need to return the kth order statistic. Again,sorted array wins. As you can see there is a tradeoff and it really depends on the problem you are given.
Let's take another example. You are given a graph of V vertices and E edges and you have to find the number of connected components in the graph. It can be done in O(V+E) time using Depth first search (assuming adjacency list representation). Consider another situation where edges are added incrementally and the number of connected components can be asked at any point of time in the process. In that situation, Disjoint Set Union data structure with rank and path compression heuristics can be used and it is extremely fast for this situation.
One more example - You need to support range update, finding sum of a subarray efficiently and no new elements are inserting into the array. If you have an array of N elements and Q queries are given, then there are two choices. If range sum queries come only after "all" update operations which are Q' in number. Then you can preprocess the array in O(N+Q') time and answer any query in O(1) time (Store prefix sums). What if there is no such order enforced? You can use Segment Tree with lazy propagation for that. It can be built in O(N log N) time and each query can be performed in O(log N) time. So you need O((N+Q)log N) time in total. Again, what if insertion and deletion are supported along with all these operations? You can use a data structure called Treap which is a probabilistic data structure and all these operations can be performed in O(log N) time. (Using implicit treap).
Note: The constant is omitted while using Big Oh notation. Some of them have large constant hidden in their complexities.
Start with common data structures. Can the problem be solved efficiently with arrays, hashtables, lists or trees (or a simple combination of them, e.g. an array of hastables or similar)?
If there are multiple options, just iterate the runtimes for common operations. Typically one data structure is a clear winner in the scenario set up for the interview. If not, just tell the interviewer your findings, e.g. "A takes O(n^2) to build but then queries can be handled in O(1), whereas for B build and query time are both O(n). So for one-time usage, I'd use B, otherwise A". Space consumption might be relevant in some cases, too.
Highly specialized data structures (e.g. prefix trees aka "Trie") are often that: highly specialized for one particular specific case. The interviewer should usually be more interested in your ability to build useful stuff out of an existing general purpose library -- opposed to knowing all kinds of exotic data structures that may not have much real world usage. That said, extra knowledge never hurts, just be prepared to discuss pros and cons of what you mention (the interviewer may probe whether you are just "name dropping").
I'm learning Kruskal's algorithm and I came across a couple of different implementations and was wondering what the tradeoffs might be between them. The two implementations are as follows:
Implementation One
- put all edges in the graph into a priority queue PQ
- remove smallest edge e from PQ
- if e connects 2 previously unconnected graph components (tested using a Union Find data structure) then add it to the MST
- repeat until the number of edges in the MST equals total number of vertices in graph - 1
Implementation Two
- perform merge sort or quick sort on all the edges in the graph
- remove smallest edge from sorted edge array
then do same as above algorithm
So the only real difference is whether to use a priority queue or perform an up-front sort in O(eloge) time.
What are the trade-offs here? Both implementations seem to have the same runtime to me - O(ElogV). I say logV and not logE because the maximum number of edges in a connected undirected graph is O(V^2) and logV^2 = 2logV and so removing constant factors can be reduced to O(logV).
Both variants have the same asymptotic complexity. The implementation with a priority queue may perform slightly better if there are a lot of edges since actually sorting them all by weight may not be necessary. One needs only the smallest edges until a spanning tree is found. The exact order of the remaining edges is irrelevant.
However, if this results in savings at all depends a lot on the input data. For example, if the edge with highest weight is part of the minimal spanning tree, all edges must be considered. In practice I would not expect much difference.
Further to Henry's answer, in practice the performance of the sorting approach would also depend on what particular variant of which sorting algorithm was used. E.g. quicksort can be O(n^2) unless an expensive median-finding algorithm is used (which it generally isn't -- and this usually doesn't cause problems).
Mergesort is worst-case O(n log n). Heapsort is too, but its memory accesses are very dispersed, so it tends to benefit much less from caching and RAM burst modes than basically all other sorting algorithms -- which on small-to-medium-size inputs, or nearly-sorted inputs, can make it even slower than fast O(n^2) sorts like insertion sort. About the most I could say with any confidence is that the PQ approach should beat a heapsort-based sorting approach, since it does a strict subset of the latter's work.
I could see in many books that the worst case memory requirement for graphs is O(V). But, If I have not mistaken, graphs are usually represented as adjacency matrix and not by the creation of nodes ( as in linked list / trees ). So, for a graph containing 5 vertices, I need 5x5 matrix which is O(V^2). They why do they say it as O(V)?
Am I missing something somewhere? Sorry if the question is too naive.
The three main ways of representing a graph are:
Adjacency matrix - Θ(|V|²) space.
Adjacency list - Θ(|V| + |E|) space.
Collection of node objects/structs with pointers to one another - This is basically just another way of representing an adjacency list. Θ(|V| + |E|). (Remember that pointers require memory too.)
Since we're talking worst case, all of these reduce to Θ(|V|²) since that's the maximum number of edges in a graph.
I'm guessing you misread the book. They probably weren't talking about the space required to store the graph structure itself, but rather the amount of extra space required for some graph algorithm.
If what you say is true, it's possible they are referring to other ways to represent a graph, other than using an adjacency matrix and are possibly making an edge-density assumption. One way is, for each vertex, just store a list of pointers / references to its neighbors (called an adjacency list). This would be O(|V| + |E|). If we assume |E| ~ |V|, which is an assumption we do sometimes see, then we have O(|V|) space. But note that in the worst-case, |E| ~ |V|^2 and so even this approach to representing a graph is O(|V|^2) in the worst case.
Look, it's quite simple; there's no escaping the fact that in the worst case |E| ~ |V|^2. There can not possibly be, in general, a representation of E that in the worst case is not O(|V|^2).
But, it'd be nice to have an exact quote to work with. This is important. We don't want to find ourselves tearing apart your misunderstanding of a correct statement.
In CLRS excise 22.1-8 (I am self learning, not in any universities)
Suppose that instead of a linked list, each array entry Adj[u] is a
hash table containing the vertices v for which (u,v) ∈ E. If all
edge lookups are equally likely, what is the expected time to
determine whether an edge is in the graph? What disadvantages does
this scheme have? Suggest an alternate data structure for each edge
list that solves these problems. Does your alternative have
disadvantages compared to the hash table?
So, if I replace each linked list with hash table, there are following questions:
what is the expected time to determine whether an edge is in the graph?
What are the disadvantages?
Suggest an alternate data structure for each edge list that solves these problems
Does your alternative have disadvantages compared to the hash table?
I have the following partial answers:
I think the expected time is O(1), because I just go Hashtable t = Adj[u], then return t.get(v);
I think the disadvantage is that Hashtable will take more spaces then linked list.
For the other two questions, I can't get a clue.
Anyone can give me a clue?
The answer to question 3 could be a binary search tree.
In an adjacency matrix, each vertex is followed by an array of V elements. This O(V)-space cost leads to fast (O(1)-time) searching of edges.
In an adjacency list, each vertex is followed by a list, which contains only the n adjacent vertices. This space-efficient way leads to slow searching (O(n)).
A hash table is a compromise between the array and the list. It uses less space than V, but requires the handle of collisions in searching.
A binary search tree is another compromise -- the space cost is minimum as that of lists, and the average time cost in searching is O(lg n).
It depends on the hash table and how it handles collisions, for example assume that in our hash table each entry points to a list of elements having the same key.
If the distribution of elements is sufficiently uniform, the average cost of a lookup depends only on the average number of elements per each list(load factor). so the average number of elements per each list is n/m where m is the size of our hash table.
The expected time to determine whether an edge is in the graph is O(n/m)
more space than linked list and more query time than adjacency matrix. If our hash table supports dynamic resizing then we would need extra time to move the elements between the old and new hash tables and if not we would need O(n) space for each hash table in order to have O(1) query time which results in O(n^2) space. also we have just checked expected query time, and In worst case we may have query time just like linked list(O(degree(u))) so it seems better to use adjacency matrix in order to have deterministic O(1) query time and O(n^2) space.
read above
yes, for example if we know that every vertices of our graph has at most d adjacent vertices and d less than n, then using hash table would need O(nd) space instead of O(n^2) and would have expected O(1) query time.
Questions 3 and 4 are very open. Besides the thoughts from other two, one problem with hash table is that it's not an efficient data structure for scanning elements from the beginning to the end. In a real world, sometimes it's pretty common to enumerate all the neighbors for a given vertex (e.g., BFS, DFS), and that somehow compromises the use of a direct hash table.
One possible solution for this is to chain existing buckets in hash table together so that they form a doubly-linked list. Every time a new element is added, connect it to the end of the list; Whenever an element is removed, remove it from the list and fix the link relation accordingly. When you want to do an overall scan, just go through this list.
The drawback of this strategy, of course, is more space. There is a two-pointer overhead per element. Also, the addition/removal of an element takes more time to build/fix the link relation.
I'm not too worried about collisions. The hash table of a vertex stores its neighbors, each of which is unique. If its key is unique, there is no chance of collision.
I wanted to add another option that no one mentioned in any of the other answers. If the graph is static, i.e. the vertices and edges don't change once you have created the graph, you could use a hash table with perfect hashing instead of an adjacency list for each vertex. This would let you look up in worst case O(1) time whether there is an edge between vertices and uses only O(V+E) memory, so asymptotically the same memory use as a normal adjacency list. The advantage is that the O(1) search time to check if there is an edge between vertices is the worst case not the average case.