Neo4j cypher query takes an infinite time to execute - performance

I have loaded in a local docker instance of neo4j 3.3.1 community 147 nodes connected by 1718 relationships. This form a highly cyclic graph.
All the nodes have the same label :EClass and two attributes, class and package.
The following query counts the numbers of classes reachable from the package modelQueryLanguage by following an infinite number of steps.
MATCH (a:EClass {package: 'modelQueryLanguage'})-[*1..]->(b)
RETURN count(DISTINCT b)
The problem is, this query never finish.
My instinct tells me that the distinct clause is supposed to define a stop condition for the potentially infinite traversal of the graph.
How can I write an equivalent cypher query but which execute fast?

Cypher's mode of expansion will attempt to find all possible paths matching the pattern, with the only restriction that a relationship cannot occur more than once per path. With highly connected graphs (and inadequate restrictions on relationship type/direction), this becomes an infeasible means of expansion, as the number of possible unique paths in the graph to every other node in the graph can become huge. This is not ideal for a reachability query.
APOC Procedures has some path expander procedures that are made specifically for use cases like this, where only a single path per node is needed, not all possible paths. And if you just need the nodes and not the paths, there's a procedure for that too.
Here's an example of usage for your query:
MATCH (a:EClass {package: 'modelQueryLanguage'})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'>'}) YIELD node
RETURN count(node) as count

Related

Cypher query for finding out content and connectedness as security measure

I have a node that represents a computer infected with malware. I want to see if other computers (based on log files) have had some interaction with the infected computer. I have already transferred and mapped log files into the Memgraph database.
How would Cypher query look for this scenario?
Basic cypher code that you can use for this scenario would be:
MATCH p1=(n:Node1)-[*]->(m:Node2), p2=(n)-[*]->(m), (n)-[r]->(f:FraudulantActivity)
WHERE p1!=p2
RETURN nodes(p1)+nodes(p2)
This Cypher query looks for different paths p1 and p2 between node named n and node named m and returns such nodes on those different paths. Those nodes could be part of some malicious actions.

In neo4j on what basis paths from apoc.path.spanningTree will get sorted(default sort)?

I am using apoc.path.spanningTree with some relationship filters and some label filters with maxLevel:-1
as a result, I am getting 5 paths as output in some order. I am not able to understand the basis of its sorting.
What I have noticed is, sorting is taking place on the basis of neo4j id of the last node in the path.
But If I update any intermediate node in any of the paths then this order changes.
The procedure is not documented to return the paths in any particular order, so you should not assume a particular ordering is used. And the algorithm can change at any time anyway.
If your query needs the paths in a specific order, it should sort the returned paths itself.

Why do these two seemingly identical Cypher queries differ so greatly in speed?

I have a query like this as a key component of my application:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group)
RETURN node
There is an index on :GroupType(Name)
In a database of roughly 10,000 elements this query uses nearly 1 million database hits. Here is the PROFILE of the query:
However, this slight variation of the query which performs an identical search is MUCH faster:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)-[:MEMBER_OF]->(group)
RETURN node
The only difference is the node:NodeType match and the relationship match are merged into a single MATCH instead of a MATCH ... WHERE. This query uses 1/70th of the database hits of the previous query and is more than 10 times faster, despite performing an identical search:
I thought Cypher treated MATCH ... WHERE statements as single search expressions, so the two queries should compile to identical operations, but these two queries seem to be performing vastly different operations. Why is this?
I would like to start by saying that this is not actually a Cypher problem. Cypher describes what you want, not how to get it, so the performance of this query will very vastly between say, Neo4J 3.1.1 and Neo4J 3.2.3.
As the one executing the Cypher is the one that decides how to do this, the real question is "Why doesn't the Neo4J Cypher planner not treat these the same?"
Ideally, both of these Cyphers should be equivalent to
MATCH (node:NodeType)-[:MEMBER_OF]->(group:GroupType{name:"String"})
RETURN node
because they should all produce the same results.
In reality, there are a lot of subtle nuances with dynamically parsing a query that has very many 'equivalent' expressions. But a subtle shift in context can change that equivalence, say if you did this adjustment
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group) OR SIZE(group.members) = 1
RETURN node
Now the two queries are almost nothing alike in their results. In order to scale, the query planner must make decision shortcuts to come up with an efficient plan as quickly as possible.
In sort, the performance depends on what the server you are throwing it at is running because coming up with an actionable lookup strategy for a language that lets you ask for ANYTHING/EVERYTHING is hard!
RELATED READING
Optimizing performance
What is Cypher?
MATCH ... WHERE <pattern> isn't the same as MATCH <pattern>.
The first query performs the match, then uses the pattern as a filter to perform for all built up rows.
You can see in the query plan that what's happening is a cartesian product between your first match results and all :NodeType nodes. Then for each row of the cartesian product, the WHERE checks to see if the the :GroupType node on that row has is connected to the :NodeType node on that row by the given pattern (this is the Expand(Into) operation).
The second query, by contrast, expands the pattern from the previously matched group nodes, so the nodes considered from the expansion are far less in number and almost immediately relevant, only requiring a final filter to ensure that those nodes are :NodeType nodes.
EDIT
As Tezra points out, Cypher operates by having you define what you want, not how to get it, as the "how" is the planner's job. In the current versions of Neo4j (3.2.3), my explanation stands, in that the planner interprets each of the queries differently and generates different plans for each, but that may be subject to change as Cypher evolves and the planner improves.
In these cases, you should be running PROFILEs on your queries and tuning accordingly.

Neo4j optimization: Query for all graphs from selected to selected nodes

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

Neo4j - slow cypher query - big graph with hierarchies

Using Neo4j 2.1.4. I have a graph with 'IS A' relationships (and other types of relationships) between nodes. I have some hierarchies inside the graph (IS A relationships) and I need to know the descendants (IS A relationship) of one hierarchy that has a particular-known relationship with some descendant of second hierarchy. If that particular-known relationship exists, I return the descendant/s of the first hierarchy.
INPUTS: 'ID_parentnode_hierarchy_01', 'ID_relationship', 'ID_parentnode_hierarchy_02'.
OUTPUT: Descendants (IS A relationship) of 'ID_parentnode_hierarchy_01' that has 'ID_relationship' with some descendant of 'ID_parentnode_hierarchy_02'.
Note: The graph has 500.000 nodes and 2 million relationships.
I am using this cypher query but it is very slow (aprox. 40s in a 4GB RAM and 3GHz Pentium Dual Core 64 bit PC). It is possible to build a faster query?
MATCH (parentnode_hierarchy_01: Node{nodeid : {ID_parentnode_hierarchy_01}})
WITH parentnode_hierarchy_01
MATCH (parentnode_hierarchy_01) <- [:REL* {reltype: {isA}}] - (descendants01: Node)
WITH descendants01
MATCH (descendants01) - [:REL {reltype: {ID_relationship}}] -> (descendants02: Node)
WITH descendants02, descendants01
MATCH (parentnode_hierarchy_02: Node {nodeid: {ID_parentnode_hierarchy_02} })
<- [:REL* {reltype: {isA}}] - (descendants02)
RETURN DISTINCT descendants01;
Thank you very much.
Well, I can slightly clean up your query - this might help us understand the issues better. I doubt this one will run faster, but using the cleaned up version we can discuss what's going on: (mostly eliminating unneeded uses of MATCH/WITH)
MATCH (parent:Node {nodeid: {ID_parentnode_hierarchy_01}})<-[:REL* {reltype:{isA}}]-
(descendants01:Node)-[:REL {reltype:{ID_relationship}}]->(descendants02:Node),
(parent2:Node {nodeid: {ID_parentnode_hierarchy_02}})<-[:REL* {reltype:{isA}}]-
(descendants02)
RETURN distinct descendants01;
This looks like you're searching two (probably large) trees, starting from the root, for two nodes somewhere in the tree that are linked by an {ID_relationship}.
Unless you can provide some query hints about which node in the tree might have an ID_relationship or something like that, at worst, this looks like you could end up comparing every two nodes in the two trees. So this looks like it could take n * k time, where n is the number of nodes in the first tree, k the number of nodes in the second tree.
Here are some strategy things to think about - which you should use depends on your data:
Is there some depth in the tree where these links are likely to be found? Can you put a range on the depth of [:REL* {reltype:{isA}}]?
What other criteria can you add to descendants01 and descendants02? Is there anything that can help make the query more selective so that you're not comparing every node in one tree to every node in the other?
Another strategy you might try is this: (this might be a horrible idea, but it's worth trying) -- basically look for a path from one root to the other, over any number of undirected edges of either isa type, or the other. Your data model has :REL relationships with a reltype attribute. This is probably an antipattern; instead of a reltype attribute, why is the relationship type not just that? This prevents the query that I want to write, below:
MATCH p=shortestPath((p1:Node {nodeid: {first_parent_id}})-[:isA|ID_relationship*]-(p2:Node {nodeid: {second_parent_id}}))
return p;
This would return the path from one "root" to the other, via the bridge you want. You could then use path functions to extract whatever nodes you wanted. Note that this query isn't possible currently because of your data model.

Resources