Which data structures are optimal in the following cases - data-structures

My answers:
a) queue
b) Directed Graph
c) Hash Table
d) 2D Array
e) Stack
f) Hash Table
g) queue
h) Array
i) Stack
j) No idea

I'd often choose the same data structures as you did, with a few exceptions:
a) A heap would sort the programs according to their priority allowing for fast O(1) access to the program with greatest priority
b) An undirected graph could be used, since telecommunication is symmetrical
e) & i) a doubly linked-list would probably be used in a real scenario since it allows for redo/forward functionality, but a stack isn't wrong either
h) A linked-list would be better since the size of entries grows constantly
j) A directed graph could be used (allowing for directory loops etc.)

Related

Recursion vs iteration with regards to memory usage

Suppose I have a recursive as well as an iterative solution (using a stack) to some problem e.g. preorder traversal of a binary tree. With current computers, memory-wise, is there an advantage to using the recursive solution over the iterative version or vice versa for very large trees?
I'm aware that for certain recursive solutions where sub-problems repeat, there are additional time and memory costs if recursion is used. Assume that this is not the case here. For example,
preOrder(Node n){
if (n == null) return;
print(n);
preOrder(n.left);
preOrder(n.right);
}
vs
preOrder(Node n){
stack s;
s.push(n);
while(!s.empty()){
Node node = s.pop();
print(node);
s.push(node.right);
s.push(node.left);
}
}
If there is a risk of stack overflow (in this case, because the trees are not guaranteed to be even semi-balanced), then a robust program will avoid recursion and use an explicit stack.
The explicit stack may use less memory, because stack frames tend to be larger than is strictly necessary to maintain the context of recursive calls. (For example, the stack frame will contain at least a return address as well as the local variables.)
However, if the recursion depth is known to be limited, then not having to dynamically allocate can save space and time, as well as programmer time. For example, walking a balanced binary tree only requires recursion to the depth of the tree, which is log2 of the number of nodes; that cannot be a very large number.
As suggested by a commentator, one possible scenario is that the tree is known to be right-skewed. In that case, you can recurse down the left branches without worrying about stack overflow (as long as you are absolutely certain that the tree is right-skewed). Since the second recursive call is in the tail position, it can just be rewritten as a loop:
void preOrder(Node n) {
for (; n; n = n.right) {
print(n);
preOrder(n.left);
n = n.right;
}
A similar technique is often (and should always be) applied to quicksort: after partitioning, the function recurses on the smaller partition, and then loops to handle the larger partition. Since the smaller partition must be less than half the size of the original array, that will guarantee the recursion depth to be less than log2 of the original array size, which is certainly less than 50 stack frames, and probably a lot less.

Why does the Clojure zipper implementation use different types and data structures from Huet's zipper?

I'm comparing Huet's original paper with Clojure's implementation and trying to figure out why the changes were made. I'm a Clojure novice, so if I'm wrong on my interpretation of the Clojure code, please correct me.
In Huet's paper, the type of a path is (in Ocaml) Top | Node of tree list * path * tree list;;. In Clojure, there are two additional fields, pnodes and changed?. What's the purpose of these fields? Am I right in believing that l and r correspond to the first and third entries in Huet's type, and that ppath is the second?
Huet's zipper uses linked lists throughout (note I'm talking about the Loc type itself, not the data structure the zipper is operating), while in some places, for instance l, the Clojure implementation uses vectors. Why the change, and what's the implication for the Clojure implementation's time complexity?
First, your understanding of l, r and ppath is correct.
pnodes and changed? work together as an optimization: when you go up if changed? is false then you pop the node from pnodes rather than rebuilding it from the current node and the left and right siblings lists.
As for the use of a vector for l and a list for r. Again it's about the cost of rebuilding a node. In Huet's paper there's (rev left) # (t::right) which is O(nleft) where nleft is the size of left. In Clojure we have (concat l (cons node r)) which is O(1) [1] because l being a vector does not need to be reversed (vectors in Clojure can be efficiently traversed in any direction but are appendable only at the right).
[1] ok it's O(1) only at creation time: nleft conses will be lazily allocated as the resulting sequence is consumed by further computation.

How to update element priorities in a heap for Prim's Algorithm?

I am studying Prim's Algorithm. There is a part within the code the next vertex across the cut will be coming to the set of the vertices belonging to the MST. While doing that, we also have to 'update all vertices in the other set which are adjacent to the departing vertex'. This is a snapshot from CLRS:
The interesting part lies in line no. 11. But since we are using a heap here, we have access to only the minimum element, right (heap[0])? So how do we search and update vertices from the heap even though they are not the minimum one and thus we have knowing where they are except by linear search?
It is possible to build priority queues that support an operation called decrease-key, which takes the priority of an existing object in a priority queue and lowers it. Most versions of priority queues that ship with existing libraries don't support this operation, but it's possible to build it in several ways.
For example, given a binary heap, you can maintain an auxiliary data structure that maps from elements to their positions in the binary heap. You would then update the binary heap implementation so that whenever a swap is performed, this auxiliary data structure is updated. Then, to implement decrease-key, you could access the table, find the position of the node in the binary heap, then continue a bubble-up step.
Other pointer-based heaps like binomial heaps or Fibonacci heaps explicitly support this operation (the Fibonacci heap was specifically designed for it). You usually have an auxiliary map from objects to the node they occupy in the heap and can then rewire the pointers to move the node around in the heap.
Hope this helps!
Pointers enable efficient composite data structures
You have something like this (using pseudocode C++):
class Node
bool visited
double key
Node* pi
vector<pair<Node*, double>> adjacent //adjacent nodes and edge weights
//and extra fields needed for PriorityQueue data structure
// - a clean way to do this is to use CRTP for defining the base
// PriorityQueue node class, then inherit your graph node from that
class Graph
vector<Node*> vertices
CRTP: http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
The priority queue Q in the algorithm contains items of type Node*, where ExtractMin gets you the Node* with minimum key.
The reason you don't have to do any linear search is because, when you get u = ExtractMin(Q), you have a Node*. So u->adjacent gets you both the v's in G.Adj[u] and the w(u,v)'s in const time per adjacent node. Since you have a pointer v to the priority queue node (which is v), you can update it's position in the priority queue in logarithmic time per adjacent node (with most implementations of a priority queue).
To name some specific data structures, the DecreaseKey(Q, v) function used below has logarithmic complexity for Fibonnaci heaps and pairing heaps (amortized).
Pairing heap: http://en.wikipedia.org/wiki/Pairing_heap
Not too hard to code yourself - Wikipedia has most of the source code
Fibonacci heap: http://en.wikipedia.org/wiki/Fibonacci_heap
More-concrete pseudocode for the algorithm
MstPrim(Graph* G)
for each u in G->vertices
u->visited = false
u->key = infinity
u->pi = NULL
Q = PriorityQueue(G->vertices)
while Q not empty
u = ExtractMin(Q)
u->visited = true
for each (v, w) in u->adjacent
if not v->visited and w < v->key
v->pi = u
v->key = w
DecreasedKey(Q, v) //O(log n)

Three-way set Disjointness

this is a question from my practice problems for an upcoming test.
I was hoping to get help in finding a more efficient solution to this problem. Right now, I know I can solve this type of problem just by using 3 simple for loops, but that would be O(N^3).
Furthermore, I believe that somehow incorporating binary search will be the best way, and give me the O(log n) in the answer that I'm looking for. Unfortunately, I'm kind of stuck.
The three-way set disjointness problem is defined as follows: Given three sets of items, A, B, and C, they are three-way disjoint if there is no element common to all three sets, ie, there exists no x such that x is in A, B, and C.
Assume that A, B, and C are sets of items that can be ordered (integers); furthermore, assume that it is possible to sort n integers in O(n log n) time. Give an O(n log n) algorithm to decide whether the sets are three-way set disjoint.
Thanks for any assistance
The question statement has given obvious hint on how to solve the problem. Assuming the 3 sets are mathematical sets (elements are unique within each set), just mix 3 sets together and sort them, then traverse the list linearly and search whether there are 3 instances of the same item. The time complexity is dominated by sorting, which is O(n log n). The auxiliary space complexity is at most O(n).
Another solution is to use hash-based map/dictionary. Just count the frequency of the items across 3 sets. If any of the items has frequency equal to 3 (this can be checked when the frequency is retrieved for update), the 3 sets are not 3-way disjoint. Insertion, access and modification can be done in O(1) amortized complexity, so the time complexity is O(n). The space complexity is also O(n).
If complexity is the constraint (and neither space or the constant term are), this can be solved in O(n). Create two bitmaps, mapping integers from A to the first and integers from B to the second. Then traverse the third (C) until you exhaust, or you find an entry where bitmapA(testInt) and bitmapB(testInt) are both set.
We can solve this problem with o(n). This is possible if we use Set data structure and consider initial capacity and load factor into consideration.
public static boolean checkThreeWaySetDisjointness(Set<Integer> a, Set<Integer> b, Set<Integer> c)
{
int capacity = Math.round((a.size() + b.size()) / 0.75f) + 1;
Set<Integer> container = new HashSet<Integer>(capacity);
container.addAll(a);
for (int element : b)
{
if (!container.add(element))
{
if (c.contains(element))
{
return false;
}
}
}
return true;
}
We are creating new Set container because if we start adding directly in any of existing Set a/b/c, once its capacity of 75 % of actual size is reached, internally java will create new Hashset and copies entire existing Set in it. This overhead will have complexity of O(n). Hence we are creating here new HashSet of size capacity which will ensure there will not be an overhead of copying. Then copy entire Set a and then go on adding one by one element from Set b. In Java , if add() returns false means element already exists in current collection. If yes, just check for the same in third Set c. Method add and contains of HashSet have complexity of O(1) so this entire code runs in O(n).

Pairwise priority queue

I have a set of A's and a set of B's, each with an associated numerical priority, where each A may match some or all B's and vice versa, and my main loop basically consists of:
Take the best A and B in priority order, and do stuff with A and B.
The most obvious way to do this is with a single priority queue of (A,B) pairs, but if there are 100,000 A's and 100,000 B's then the set of O(N^2) pairs won't fit in memory (and disk is too slow).
Another possibility is for each A, loop through every B. However this means that global priority ordering is by A only, and I really need to take priority of both components into account.
(The application is theorem proving, where the above options are called the pair algorithm and the given clause algorithm respectively; the shortcomings of each are known, but I haven't found any reference to a good solution.)
Some kind of two layer priority queue would seem indicated, but it's not clear how to do this without using either O(N^2) memory or O(N^2) time in the worst case.
Is there a known method of doing this?
Clarification: each A must be processed with all corresponding B's, not just one.
Maybe there's something I'm not understanding but,
Why not keep the A's and B's in separate heaps, get_Max on each of the heaps, do your work, remove each max from its associated heap and continue?
You could handle the best pairs first, and if nothing good comes up mop up the rest with the given clause algorithm for completeness' sake. This may lead to some double work, but I'd bet that this is insignificant.
Have you considered ordered paramodulation or superposition?
It appears that the items in A have an individual priority, the items in B have an individual priority, and the (A,B) pairs have a combined priority. Only the combined priority matters, but hopefully we can use the individual properties along the way. However, there is also a matching relation between items in A and items in B that is independent priority.
I assume that, for all a in A, b1 and b2 in B, such that Match(a,b1) and Match(a,b2), then Priority(b1) >= Priority(b2) implies CombinedPriority(a,b1) >= CombinedPriority(a,b2).
Now, begin by sorting B in decreasing order priority. Let B(j) indicate the jth element in this sorted order. Also, let A(i) indicate the ith element of A (which may or may not be in sorted order).
Let nextb(i,j) be a function that finds the smallest j' >= j such that Match(A(i),B(j')). If no such j' exists, the function returns null (or some other suitable error value). Searching for j' may just involve looping upward from j, or we may be able to do something faster if we know more about the structure of the Match relation.
Create a priority queue Q containing (i,nextb(i,0)) for all indices i in A such that nextb(i,0) != null. The pairs (i,j) in Q are ordered by CombinedPriority(A(i),B(j)).
Now just loop until Q is empty. Pull out the highest-priority pair (i,j) and process (A(i),B(j)) appropriately. Then re-insert (i,nextb(i,j+1)) into Q (unless nextb(i,j+1) is null).
Altogether, this takes O(N^2 log N) time in the worst case that all pairs match. In general, it takes O(N^2 + M log N) where M are the number of matches. The N^2 component can be reduced if there is a faster way of calculating nextb(i,j) that just looping upward, but that depends on knowledge of the Match relation.
(In the above analysis, I assumed both A and B were of size N. The formulas could easily be modified if they are different sizes.)
You seemed to want something better than O(N^2) time in the worst case, but if you need to process every match, then you have a lower bound of M, which can be N^2 itself. I don't think you're going to be able to do better than O(N^2 log N) time unless there is some special structure to the combined priority that lets you use a better-than-log-N priority queue.
So you have a Set of A's, and a set of B's, and you need to pick a (A, B) pair from this set such that some f(a, b) is the highest of any other (A, B) pair.
This means you can either store all possible (A, B) pairs and order them, and just pick the highest each time through the loop (O(1) per iteration but O(N*M) memory).
Or you could loop through all possible pairs and keep track of the current maximum and use that (O(N*M) per iteration, but only O(N+M) memory).
If I am understanding you correctly this is what you are asking.
I think it very much depends on f() to determine if there is a better way to do it.
If f(a, b) = a + b, then it is obviously very simple, the highest A, and the highest B are what you want.
I think your original idea will work, you just need to keep your As and Bs in separate collections and just stick references to them in your priority queue. If each reference takes 16 bytes (just to pick a number), then 10,000,000 A/B references will only take ~300M. Assuming your As and Bs themselves aren't too big, it should be workable.

Resources