OrientDB Delete network BigData - performance

I m using orientdb community version 2.2.20 .
I have import a large dataset of for about 15M edges and 30K vertices
What is the best way to delete a graph is it to delete edges , and then the vertices ? or is there a way to delete the whole graph at once?
Reading the documentation I only find the DELETE VERTEX, and EDGE commands

It doesn't exist a command that delete the entire graph but you can do :
DELETE Edge E
DELETE Vertex V
It will delete all the Edge and all the Vertex.
Hope it helps.

If you're deleting everything, you can use also the TRUNCATE CLASS command specifying also UNSAFE. That is much faster. Example to delete any vertices and edges in your database:
TRUNCATE CLASS V POLYMORPHIC UNSAFE
TRUNCATE CLASS E POLYMORPHIC UNSAFE

Related

Does this graph reduction operation already exist?

I have an application that uses a directed acyclic graph (DAG) to represent events ordered by time. My goal is to create or find an algorithm to simplify the graph by removing certain edges with specific properties. I'll try to define what I mean:
In the example below, a is the first node and f is the last. In the first picture, there are four unique paths to use to go from a to f. If we isolate the paths between b and e, we have two alternative paths. The path that is a single edge, namely the edge between b and e is the type of path that I want to remove, leaving the graph in the second picture as a result.
Therefore, all the edges I want to remove are defined as: single edges between two nodes that have at least one other path with >1 edges.
I realize this might be a very specific kind of graph operation, but hoping this algorithm already exists out there, my question to Stack Overflow is: Is this a known graph operation, or should I get my hiney to the algorithm drawing board?
Like Matt Timmermans said in the comment: that operation is called a transitive reduction.
Thanks Matt!

Find all crtical node sets in a graph

Given a graph , I want to find the sets S1, S2, ... of nodes whose removal may disconnect the network. Each of these sets may contain a single node or more.
Also any of these sets are not subsets of each other i.e. we do not consider S3=S1 U S2 though it also disconnects the network.
We don't want to find :
Only one critical node set but all
The single set of nodes that disconnect the network to a maximum extent.
Any suggestions on any of these:
Hardness of the problem
Some direction/paper reference to the solution
Any proofs that I may have to give
Vertex sets whose removal disconnects a graph are called separators. See e.g. A. Berry, J-P Bordat, O. Cogis: Generating all the Minimal Separators of a Graph.
What you are trying to find is graph partitioning.
It is a NP Hard problem.
See this for more understanding of graph partitioning. there are Nearly-Linear Time Algorithms for Graph Partitioning available if you google for it.

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

How to delete all related nodes in a directed graph using networkx?

I'm not sure exactly sure what the correct terminology is for my question so I'll just explain what I want to do. I have a directed graph and after I delete a node I want all independently related nodes to be removed as well.
Here's an example:
Say, I delete node '11', I want node '2' to be deleted as well(and in my own example, they'll be nodes under 2 that will now have to be deleted as well) because its not connected to the main graph anymore. Note, that node '9' or '10' should not be deleted because node '8' and '3' connect to them still.
I'm using the python library networkx. I searched the documentation but I'm not sure of the terminology so I'm not sure what this is called. If possible, I would want to use a function provided by the library than create my own recursion through the graph(as my graph is quite large).
Any help or suggestions on how to do this would be great.
Thanks!
I am assuming that the following are true:
The graph is acyclic. You mentioned this in your comment, but I'd like to make explicit that this is a starting assumption.
There is a known set of root nodes. We need to have some way of knowing what nodes are always considered reachable, and I assume that (somehow) this information is known.
The initial graph does not contain any superfluous nodes. That is, if the initial graph contains any nodes that should be deleted, they've already been deleted. This allows the algorithm to work by maintaining the invariant that every node should be there.
If this is the case, then given an initial setup, the only reason that a node is in the graph would be either
The node is in the root reachable set of nodes, or
The node has a parent that is in the root reachable set of nodes.
Consequently, any time you delete a node from the graph, the only nodes that might need to be deleted are that node's descendants. If the node that you remove is in the root set, you may need to prune a lot of the graph, and if the node that you remove is a descendant node with few of its own descendants, then you might need to do very little.
Given this setup, once a node is deleted, you would need to scan all of that node's children to see if any of them have no other parents that would keep them in the graph. Since we assume that the only nodes in the graph are nodes that need to be there, if the child of a deleted node has at least one other parent, then it should still be in the graph. Otherwise, that node needs to be removed. One way to do the deletion step, therefore, would be the following recursive algorithm:
For each of children of the node to delete:
If that node has exactly one parent: (it must be the node that we're about to delete)
Recursively remove that node from the graph.
Delete the specified node from the graph.
This is probably not a good algorithm to implement directly, though, since the recursion involved might get pretty deep if you have a large graph. Thus you might want to implement it using a worklist algorithm like this one:
Create a worklist W.
Add v, the node to delete, to W.
While W is not empty:
Remove the first entry from W; call it w.
For each of w's children:
If that child has just one parent, add it to W.
Remove w from the graph.
This ends up being worst-case O(m) time, where m is the number of edges in the graph, since in theory every edge would have to be scanned. However, it could be much faster, assuming that your graph has some redundancies in it.
Hope this helps!
Let me provide you with the python networkX code that solves your task:
import networkx as nx
import matplotlib.pyplot as plt#for the purpose of drawing the graphs
DG=nx.DiGraph()
DG.add_edges_from([(3,8),(3,10),(5,11),(7,11),(7,8),(11,2),(11,9),(11,10),(8,9)])
DG.remove_node(11)
connected_components method surprisingly doesn't work on the directed graphs, so we turn the graph to undirected, find out not connected nodes and then delete them from the directed graph
UG=DG.to_undirected()
not_connected_nodes=[]
for component in nx.connected_components(UG):
if len(component)==1:#if it's not connected, there's only one node inside
not_connected_nodes.append(component[0])
for node in not_connected_nodes:
DG.remove_node(node)#delete non-connected nodes
If you want to see the result, add to the script the following two lines:
nx.draw(DG)
plt.show()

How to find all paths in a graph between two nodes up to a given number of intermediate nodes?

I have a huge directed graph with about a million nodes and more than ten million edges. The edges are not weighted. The graph is a small-world like graph. In fact I see that every node is (on average) connected to another node over three intermediate nodes.
Given this graph can you think of a fast algorithm that returns all paths (without cycles) between a start and a destination node, but only up to a given maximum number N of intermediate nodes (and in my case N most of the time will be between 0 and 3)?
If your graph was undirected, you would certainly want to do a bidirectional breadth first search. For length 2 paths, enumerate edges from the start node and the end node and see where they intersect. For the length 3 paths, go 2 deep from the end point with smaller degree, and one deep on the node with greater degree.
Since your graph is directed, you might want to also keep reverse edges so you can do the same trick.
Perhaps breath-first from both directions at once? Take neighbours of A, and neighbours of B. if you haven't found a link yet, add A to "neighbours of a" and B to "neighbours of B", then find any link between the two sets.
To extend it a bit further than three links, the "neighbours of A/B" lists need to contain a bit more. You will not be able to do it in-memory - you'll need a scratch table with
whatever TRANSACTION_ID; (or use an ORACLE 1-per-session temp table)
boolean MY_BFS_WAS_ROOTED_AT_A;
int NODE_ID;
int previous_node_id;
(you don't need to track depth if you check for loops in your insert statement)
you have found a path when there exists any
select from pathfinder a, pathfinder b
where a.taxn_id = foo and b.tnx_id=foo
and a.MY_BFS_WAS_ROOTED_AT_A = false
and b.MY_BFS_WAS_ROOTED_AT_A = true
and a.node_id = b.node_id
Don't forget to clean out the table when you are done! Doing it all as one transaction and rolling it back might be the easiest way.

Resources