storing the graph with unknown nodes order

storing the graph with unknown nodes order - algorithm

What is the best approach to store the graph with unknown order of nodes in a vector. For example i have node coming in an unknown order like 35,23,89,200,12,89,569 etc... I want to store them in such a way that the memory is not wasted and the nodes are accessed efficiently if in constant time then it will be great. May be some hash function will work but if there is one that can distinguish the nodes please tell me or if there is any other approach for that.
Thanks

Simplest solution I think of is just insert them to your vector in order, and create a map<int,int> to map from their values to their indexes.
In your example:
map[35] == 0
ma[[23] == 1
map[89] == 2
map[200] == 3
map[12] == 4
...
now, access the node i as vector[map[i]]
EDIT:
A second possibility will be to use a set instead of vector to hold elements, but it might not always be desired [set has no duplicates, and will no contain elements in the order you inserted them], but consider if it suits you.

Related

Using Linked List to represent a Matrix Class

I'm having trouble initializing the linked list for the matrix based on the parameters I input. So if I input the parameters (3,3) it should actually make make 4x4 so I can use the first column and first row for indexing. and the left top corner node as an entry point.
def __init__(self, m, n, default=0):
self._head = MatrixNode(None)
for node in range(m - 1):
node = MatrixNode(0)
node._right = node
for node in range(n - 1):
node = MatrixNode(0)
node._down = node
this is what I have so far but I'm sure its horrible.

At first, it may be useful to know, what a MatrixNode is. I guess you just want to store a value there?
Then i see two linear loops, while a matrix is a n*m data structure. Are you sure, your loops do not need to be nested to initialize your structure correctly?
For linked lists i would expect something like row.next = nextrow and row.startnode.next = nextnode, i do not see anything like this here.
Having this said, i want to ask you, if you really want to implement a matrix yourself, and in such an object oriented (inefficient!) way.
You can use two-dimensional arrays (a=[[1,2], [3,4]];a[0][0]==1) or a good implementation from a numerics library like numpy/scipy.
There you have numpy.array for storing n-dimensional data (with nice addressing like matrix[1,2] and similiar syntax to matlab) or numpy.matrix which is like an array with some methods overloaded for matrix operations (i.e. matrix-matrix multiplication of arrays is pointwise and for matrices it's the usual matrix multiplication).

You are right, its horrible :)
First things first, a linked list is a very bad way of representing a matrix.
If you want to represent a matrix, start with a list of lists, and work from there if that's not enough (see other answer mentioning numpy, for example)
If you want to learn to use linked lists, choose a better example.
Then: you are re-using the variable name "node" for different things:
Your loop index. The code for node in range(...) will assign an integer from the range to node in every iteration.
Then you assign a new MatrixNode to node, and then you set the node's neighbor (_right or _down) to be not the actual neighbor, but itself (node._right = node).
You also never save your nodes that you create inside the loops anywhere, so they will be garbage-collected.
And you never use the optional argument default.

Binary Tree Structure

I am trying to solve this problem where a new joining peer will be given an index [0,1,2, ... n-1] based on how many peer objects already exist (e.g. 8 exist -> new peer will get index 8).
I want to add these peer objects into a binary tree based on their index. For example, peer 0 joins and it will be the root, then peer 1 & peer 2 will be peer 0's left and right children.
I only need the binary tree to follow the rule that it should have two children.
Here's an example:
0
/ \
1 2
/ \ / \
3 4 5 6
My problem is that I am unsure of how to actually do this insertion to keep the 2 children rule. At first I assumed a normal BST insertion rule would work, but once I actually coded that up, I realized the problem of the pivot/key - I am inserting based on the index. Everything would just become a right child
I am really stuck on this but I think the solution should be a trivial one that I am just unable to see. Any advice?
Edit:
Thank you for the help!
I think I figured out something that will meet my needs so I'll leave it here. I will have an implicit binary tree structure. Peers that join up will get put into a priority queue based on their index. This will signify whether they can have children assigned to them & a peer will be removed from this queue once it has 2 children

A few things that you want to consider:
Why do you want a BST?
BSTs are, as the name implies, primarily for searching. But if you are assigning every new user that joins a unique identifier, then you don't need to use a BST to search for them, because you can access them from an array by index.
A BST would be more useful if, say, each user played a game and earned a particular score. To organize the users in a data structure that would render them easily searchable/organizable by score, you might insert a player into the BST with their score as the key when they finished the game. But for unique identifiers like this, there is no reason fo use a BST. In fact, the data structure that you show there is not a BST. The BST would look like this:
3
/ \
1 5
/ \ / \
0 2 4 6
Is another data structure more appropriate?
If you have gotten a better sense of why a BST is not a useful structure for organizing user IDs, then you should next think about what is is you were actually trying to do. If you were just trying to store all the users in a data structure, a list (array) is totally fine, where the index of the list corresponds to the user id.
If you are instead looking to add some sort of grouping to these users, consider using a hash table. For example, if you wanted to be able to look up a user's friends, you would create a hash table where a user id (key) maps to a list of friends' user ids (value).
Hopefully this has been helpful. If there is any more that I can do to help or if I have not fully understood what you are trying to accomplish, just let me know
UPDATE
So based on the comments above, it seems your confusion is on the distinction between a binary tree and a BST. A binary tree is any tree where each node has <= 2 children, whereas a Binary search tree imposes additional constraints on the values of the nodes' keys. The binary tree structure is what you want, but you don't need it for searching, nor do you want to compare those values.

For any given index, i, the parent node would have got the index (i + 1 >> 1) - 1 and the children nodes the indices (i << 1) + 1 and i + 1 << 1. I don't know if this is of any help, as I'm unsure of the purpose of your endeavour. But this at least means that you could save all your peers in a plain array and use that plain array as a binary tree structure by accessing a node's children by just using the node's index.

A red black tree with the same key multiple times: store collections in the nodes or store them as multiple nodes?

Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?

down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.

Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.

I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.

In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?

Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

A data structure based on the R-Tree: creating new child nodes when a node is full, but what if I have a lot of objects at the exact same position?

I realize my title is not very clear, but I am having trouble thinking of a better one. If anyone wants to correct it, please do.
I'm developing a data structure for my 2 dimensional game with an infinite universe. The data structure is based on a simple (!) node/leaf system, like the R-Tree.
This is the basic concept: you set howmany childs you want a node (a container) to have maximum. If you want to add a leaf, but the node the leaf should be in is full, then it will create a new set of nodes within this node and move all current leafs to their new (more exact) node. This way, very populated areas will have a lot more subdivisions than a very big but rarely visited area.
This works for normal objects. The only problem arises when I have more than maxChildsPerNode objects with the exact same X,Y location: because the node is full, it will create more exact subnodes, but the old leafs will all be put in the exact same node again because they have the exact same position -- resulting in an infinite loop of creating more nodes and more nodes.
So, what should I do when I want to add more leafs than maxChildsPerNode with the exact same position to my tree?
PS. if I failed to explain my problem, please tell me, so I can try to improve the explanation.
Update: this is how I check if all leafs in a full node have identical positions:
//returns whether all leafs in the given leaf list are identical
private function allLeafsHaveSamePos(leafArr:Array<RTLeaf<T>>):Bool {
if (leafArr.length > 1) {
var lastLeafTopX:Float = leafArr[0].topX;
var lastLeafTopY:Float = leafArr[0].topY;
for (i in 1...leafArr.length) {
if ((leafArr[i].topX != lastLeafTopX) || (leafArr[i].topY != lastLeafTopY)) return false;
}
}
return true;
}

I would like to ask a question...
is it that important than the maxChildsPerNode constraint be respected ?
I would rather think of this maximum as a hint to the structure for when to split, and simply ignore it when there is no meaningful way to perform the split.
Of course you'd better rethink the name then, otherwise it'd be odd for the next maintainer.
In pseudo code I would use something like this:
def AddToNode(node, item):
node.items.append(item)
if len(node.items) > node.splitHint:
leftNode = Node(node.splitHint)
rightNode = Node(node.splitHint)
node.split(leftNode, rightNode)
if len(leftNode.items) == 0 or len(rightNode.items) == 0:
node.splitHint *= 1.5 # famous golden ratio ;)
else:
node.items = [leftNode, rightNode]
The important step is to modify the hint when it's detected than we can't abide by it in order not to perform this check at each insertion (this way we obtain a constant amortized cost).

It looks like a bit of a mismatch between your data and your structure, since you have a structure that assumes N objects within an arbitrarily large area when you're supplying it with >N objects on an infinitely small point. It might be worth using a different structure for this data?
Hack fix: apply a tiny random displacement to your newly created objects. This should allow the space to be subdivided by the existing algorithm.
Better fix: ensure that your algorithm for splitting a leaf node always generates at least 2 new leaf nodes to begin with. When reassigning objects to the new leaf nodes, or when performing normal insertions, iterate over all the candidates, and if more than one is equally suitable then you can tie-break based on how full they are. This should result in your co-located players ending up split evenly across the otherwise identical nodes.

From common sense I'd not assume having two objects in the same position ever, but if this is a part of the idea, then I would introduce one more axis, say 'spin', an integer number, and impose a restriction that all your objects are fermions (cannot have the same location and spin at the same time).

If you have a set of objects on the exact same spot, any query for a region that contains one should return all - so there's no reason to split them, as it doesn't gain you anything. Simply either count the number of distinct locations when deciding to split, or have each element on the leaf node be an object that encapsulates (coordinates, [list of objects at those coordinates]).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio