Visiting every connected node without visiting more than once

Visiting every connected node without visiting more than once - data-structures

A have a set of objects. Each object contains a list of other objects to which it is connected, but not all objects are connected to every other object. I'm trying to visit each object connected to a specific starting object. The most obvious way to accomplish this is this:
Put each object connected to the starting point into a queue
For each object in the queue:
Perform whatever operation on this object
Add this object to a list of visited objects
Check if each object connected to this object if in this visited list, if not, add it to the queue
Is there a better way that doesn't involve storing a list of every visited object?

Given the data-structure you describe (any object can connect to any other) I don't think you have a choice other than to keep an already-visited list. If your objects were in a hierarchical tree structure, then a recursive tree walking algorithm could be implemented to do what you want.
In your structure of peer-connected objects, any algorithm that tries to do away with an already-visited list will run the risk of going into infinite loops for circular references. I suppose you could create a 'visited' flag in the objects themselves, clearing them all before some algorithm runs, but this seems more clunky than the list method (and is inherently less thread-able).

Related

In breadth first search and depth first search why the visited array is initialized globally

Visited array mean by the array where we keep records of whether a node is visited or not.

Tree traversal is easy, provided the tree is well-formed. You do not have to keep track of whether you're potentially about to repeat some work in a possibly endless loop because there's exactly one path to reach each node.
But for generalised graph traversal where the graph may contain loops, many trivial algorithms will end up looping endlessly when they encounter a loop. To guarantee that such algorithms do terminate, we usually want to keep track of which nodes we've visited during the entire traversal and not re-explore a node we've already visited.
That's the purpose of the visited collection (whether that be an array, a set, etc is irrelevant). Alternatives may be for each node to have a visited flag which can be unset before traversal and set during traversal. That avoids the need for a global collection but imposes its own limitations (only a single traversal can occur at a time).
The visited collection doesn't need to be "global"1 but does need to be in a common scope shared by all parts of each traversal.
1If it is "global" then, in whatever scope that has meaning, again only a single traversal at a time is possible.

Why the visited array is initialized globally?
Since the array is used for keeping track of the entire graph, it is better to have a global/class level initialization.
Otherwise, in a method level initialization you would need to pass the tracking information (aka the visited[] array) by reference or make a new copy of it for every call to explore a node.
Further, if:
You were tracking something local to the current node; OR
The algorithm's implementation was not recursive;
you could do away with a local initialization too.

When to use parent pointers in trees?

There are many problems in which we need to find the parents or ancestors of a node in a tree repeatedly. So, in those scenarios, instead of finding the parent node at run-time, a less complicated approach seems to be using parent pointers. This is time efficient but increase space. Can anyone suggest, in which kind of problems or scenarios, it is advisable to use parent pointers in a tree?
For example - distance between two nodes of a tree?

using parent pointers. This is time efficient but increase space.
A classic trade-off in Computer Science.
In which kind of problems or scenarios, it is advisable to use parent pointers in a tree?
In cases where finding the parents in runtime would cost much more than having pointers to the parents.
Now, one has to understand what cost means. You mentioned the trade-off yourself: One should think whether or not is worth to spend some extra memory to store the pointers, in order speedup your program.

Here are some of the scenarios that I can think of, where having a parent pointer saved in a node could help improve out time complexity
-> Ancestors of a given node in a binary tree
-> Union Find Algorithm
-> Maintain collection of disjoint sets
-> Merge two sets together
Now according to me in general having a parent pointer for any kind of tree problem or trie problem would make your traversal up-down or bottom-up easier.
Hope this helps!

Just cases where you need efficient bottom-up traversal outside the context of top-to-bottom traversal as a generalized answer.
As a concrete example, let's say you have a graphics software which uses a quad-tree to efficiently draw only elements on screen and let users select elements efficiently that they click on or marquee select.
However, after the users select some elements, they can then delete them. Deleting those elements would require the quad-tree to be updated in a bottom-up sort of fashion, updating parent nodes in response to leaf nodes becoming empty. But the elements we want to delete are stored in a different selection list data structure. We didn't arrive at the elements to delete through a top-to-bottom tree traversal.
In that case it might not only be a lot simpler to implement but also computationally efficient to store pointers/indices from child to parent, and possibly even element to leaf, since we're updating the tree in response to activity that occurred at the leaves in a bottom-up fashion. Otherwise you'd have to work from top to bottom and then back up again somehow, and the removal of such elements would have to be done centrally through the tree working in a top-to-bottom-and-back-up-again fashion.
To me the most useful cases I've found would be cases where the tree needs to update as a result of activity occurring in the leaves from the "outside world", so to speak, not in the middle of descending down the tree, and often involving two or more data structures, not just that one tree itself.
Another example is like, say you have a GUI widget which, upon being clicked, minimizes its parent widget. But we don't descend down the tree to determine what widget is clicked. We use another data structure for that like a spatial hash. So in that case we want to get from child widget to parent widget, but we didn't arrive at the child widget through top-down tree traversal of the GUI hierarchy so we don't have the parent widget readily available in a stack, e.g. We arrived at the child widget being clicked on through a spatial query into a different data structure. In that case we could avoid working our way down from root to child's parent if the child simply stored its parent.

Algorithms - Graph Depth-First Search

I'm learning about graph and DFS, and trying to do something similar to how ANT resolves the dependency. I'm confused about something and all the articles I read seems to assume everyone knows this.
I'm thinking of having a Map> with key = file, and value = set of files that the key depends on.
The DFS algorithm shows that I have to change the color of the node if it's already visited, that means the reference to the same fileNode must be the same between the one in key and the one in Set<> right?
Therefore, I'm thinking, each time a Node is created (including neighbor nodes), I would add it to one more Collection (maybe another Map?), then whenever a new Node is to be add to the graph (as key), search that Collection and use that reference instead? am I wasting too much space? How is it usually done? is there some other better way?

During my studies the DFS algorithm was implement like this:
Put all the nodes of a graph into a stack (this is a structure, where you can only retrieve and delete the first element).
Retrieve the first element, set it to seen, this can either be done through the coloring or by setting an attribute, lets call it isSeen, to true.
You then look at all the neighbors of that node, and if they are not seen already, you put them in the stack.
Once you looked at all the neighbors, you remove the node from the stack and retrieve the next element of the stack and do the same as for the first.
The result will then be, that all the nodes, that can be reached from the starting node, will have an attribute that is set to seen.
Hope this helped.

Can a graph node maintain a list of references to its parents?

I have a DAG implementation that works perfectly for my needs. I'm using it as an internal structure for one of my projects. Recently, I came across a use case where if I modify the attribute of a node, I need to propagate that attribute up to its parents and all the way up to the root. Each node in my DAG currently has an adjacency list that is basically just a list of references to the node's children. However, if I need to propagate changes to the parents of this node (and this node can have multiple parents), I will need a list of references to parent nodes.
Is this acceptable? Or is there a better way of doing this? Does it make sense to maintain two lists (one for parents and one for children)? I thought of adding the parents to the same adjacency list but this will give me cycles (i.e., parent->child and child->parent) for every parent-child relationship.

It's never necessary to store parent pointers in each node, but doing so can make things run a lot faster because you know exactly where to look in order to find the parents. In your case it's perfectly reasonable.
As an analogy - many implementations of binary search trees will store parent pointers so that they can more easily support rotations (which needs access to the parent) or deletions (where the parent node may need to be known). Similarly, some more complex data structures like Fibonacci heaps use parent pointers in each node in order to more efficiently implement the decrease-key operation.
The memory overhead for storing a list of parents isn't going to be too bad - you're essentially now double-counting each edge: each parent stores a pointer to its child and each child stores a pointer to its parent.
Hope this helps!

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).

You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.

Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio