Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = user)]:VERTEX - janusgraph

I want to count how many vertices (nodes) are in the graph:
gremlin>g.V().count().next()
Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = user)]:VERTEX
Gremlin.version() is 3.2.9
I use janusGraph hbase to store data, es to store index.
But some query is ok like this:
gremlin> g.V().has('user_id','47357061').values('real_name')
==>jack
I don't understand why I can't query the count.

JanusGraph simply cannot answer that traversal without iterating over all vertices. It doesn't store the count as a separate value in the backend so it actually has to compute the count for this traversal which means iterating over all vertices.
The warning you see just informs you about this fact and advises you to only execute traversals that can be executed by using an index as all other traversals will run into scalability issues if your graph grows.
If you really need the functionality to execute traversals that cannot make use of an index as they have to traverse over all vertices in your graph (or a large number of vertices), then you should look into Hadoop-Gremlin which uses Spark to execute such a traversal in parallel with multiple Spark workers.

Related

Disjoint Set Union with directed graph

I understand the DSU strictly works with undirected graphs from this stack overflow question - Can we detect cycles in directed graph using Union-Find data structure?
Nevertheless, I am currently working on a problem that involves 400000 queries and a graph with at most 400000 nodes, in which there are two possible queries:
Connect nodes a and b (directed, of course)
Output "true" if node "x" is reachable from node 1. Otherwise, print "false."
However, my original instinct was to use DSU; that obviously would not work. Any suggestions? Thank you.
As long as you don't have to delete connections, and you're only interested in reachability from a single source, then you can do this in amortized constant time per operation pretty easily.
Maintain a set of nodes that are known to be reachable from node 1. Initially this would just contain node 1. We'll call these nodes 'marked'.
When you connect a->b, if a is marked but b is not, then mark all the newly reachable nodes. This is a transitive closure operation:
put b in a queue.
while the queue is not empty, remove a vertex, mark it reachable, and put all of its unmarked neighbors in the queue.
What you want is a data structure for a problem called 'incremental reachability'.
There are multiple ways to construct such a data structure, all have some different update/query time tradeoffs. A very simple way to achieve the goal is to use an adjacency list and use BFS every time a user queries if node "x" is reachable from 1.
This gets you update time: O(1) and query time: O(m).
A more complicated idea is 'even-shiloach' trees [1], here a BFS tree is maintained efficiently.
Total update time: O(nm) Query time: O(1).
An experimental analysis of similar algorithms can be found in [2].
[1] Shimon Even, Yossi Shiloach: An On-Line Edge-Deletion Problem. J. ACM 28(1): 1-4 (1981) https://dl.acm.org/doi/10.1145/322234.322235
[2] Fully Dynamic Single-Source Reachability in Practice: An Experimental Study, Kathrin Hanauer, Monika Henzinger and Christian Schulz https://arxiv.org/abs/1905.01216

Given a query containing two integers as nodes, find all the children of those two nodes in tree?

This is my interview question which has the following problem statement
You are given M queries (1 <= M <= 100000) where every query has 2 integers which behave as nodes of some tree. How will you give all the children(subtree) for these 2 nodes respectively.
Well my approach was naive. I used DFS from both the integers(nodes) for every query but interviewer needed some optimized approach.
More simply, we have to print sub-tree of nodes given in the queries there could be many queries, so we can't run DFS on every node in the query.
Any hints how can I optimize this ?
You could optimize an algorithm that performs DFS on both nodes if one of the nodes is a child of the other.
Suppose Node 2 is a child of Node 1. In this case, calculating the DFS on Node 1 gets all of the children of Node 2, so running DFS again on 2 is inefficient. You could accomplish this by storing intermediate values to avoid recalculation (see dynamic programming, specifically the example for Fibonacci, about how you can not recalculate values for recursive calls)
For a single query, DFS should be the optimal way. For a larger number of queries here are a few things in my mind that you could do:
Cache your results. When a number shows up frequently (say 100 times), save that printed subtree to memory and just return the result when the same number appears again.
When caching, also mark all the nodes contained in the cached subtree on your original tree. When a query contains such a node, refer to the cached subtree instead of the original tree since you have done DFS on these nodes as well.
As noted by #K. Dackow if a query contains A and B and B is a child of A, you can straight out use the DFS results for B when traversing the tree for A. If permitted you can even look into multiple queries (say 10) and see if there are any nodes that belong to the current subtree you're traversing. You can set up a queue for queries and when doing one DFS traversal, look into the top items in you queue to see if you have met any of the nodes.
Hope this helps!

Graph Partitioning

Consider I have a graph containing 10,000 nodes. Now I want to look for a certain number of nodes in the graph. I want to achieve this with graph partition technique so that if a reasonable number of desired nodes is found in some partition I can stop searching. So how to do the partitioning? What are the suitable algorithms or tools to use?
My graph is in matrix format. where mat[i][j] gives the value of the edge weight between two nodes 'i' and 'j'.
After finding a partition, I want to have a list of all the nodes present in that partition.
There are too many algorthim for graph partitioning. The most goal of those algorithms is decrease cut between partitions. but if you have another goal you must convert or model to cut goal.
you can use "Graph Voronoi Diagram Partitioner" for partitioning
another software for partitioning is METIS(http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) and Kaffpa(http://algo2.iti.kit.edu/documents/kahip/)
If I understand you correctly, you want to search for a set of nodes that match some objective -- while parallelizing this search on the partition level. In this case, a good partitioning strategy would be to balance the number of "matching nodes" across partitions.
The reason is that if those matching nodes are balanced, all parallel workers working independently on a partition have the same chance of finding/matching the vertices.
In my experience, you can achieve a very solid balancing with respect to different objectives when using random assignment of vertices to partitions -- even if the edge-cut size is not optimal.
However, it is difficult to answer your question without knowing more about your objective, so maybe you can update your question with more details.

How to Partition a graph into possibly overlapping parts such that any vertex contained in a part has at least distance k from the Boundary?

How to partition a graph into possibly overlapping parts such that any vertex is contained in a part at which it has at least distance k from the Boundary?
The problem arises in cases where the whole graph can not be loaded into a single machine
because there is not sufficient memory. So another requirement is that the partition has
somehow an equal number of vertices.
Are there any algorithms that try to minimize the common vertices between parts?
The use case here is this: You want to perform a query starting from an initial vertex that you know will require maximum k traversals. Having a part that contains all the vertices of this query results in zero
network utilization.
The problem thus is to reduce the memory overhead of such a partition.
Any books I should read?
I found this which looks promising:
http://grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf
final edit: It is no coincidence that google decided to use a Hash partition.
Finding a good partition is difficult. I ll go with a hash partition as well and hope
that the data center has good network bandwidth.
You can use a breadth first search to get all the nodes that are only k distance away from the node in question, starting with the node itself. When you reach k away from the origin, you can end the search.
Edit:
Use a depth first search to assign a minimum distance from boundary property to each node. Once you have completed the depth first search, a simple iteration through the nodes should provide the partitions. For example, you can create a table that stores the minimum distance from boundary as the key and a vector of nodes as the value to represent the partition.

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

Resources