Neo4j optimization: Query for all graphs from selected to selected nodes - performance

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel

If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

Related

Matching edge in Neo4j Cypher is really slow

I have a database with 500K nodes and 700K relationships. I created 500 additional relationships with a new typeDummyEdge with edge_id attributes from "1" to "500". Now I want to query and modify these relationships. Running a query
MATCH ()-[e:DummyEdge {edge_id:"123"}]->() SET e.property="value" is really slow, it takes around 300ms, so if I run 500 such queries, it takes around 2-3 minutes. I also called CREATE INDEX ON :DummyEdge(edge_id) but it didn't speed up the query execution.
Is there any way to make such bulk relationship modification faster?
CREATE INDEX creates an index for nodes, so such an index would make no difference in the performance of your query.
Since your MATCH pattern, ()-[e:DummyEdge {edge_id:"123"}]->(), provided no information about the end nodes, neo4j has to scan every relationship in the DB to find the ones you want. That is why your query is so slow.
It would be much more efficient if (as # MichaelHunger stated) your query provided useful information (like a label, or an indexed label/property pair) for either of the nodes in your MATCH pattern. That would help neo4j narrow down the number of relationships that need to be scanned. As an example, let's state that the start node must have the Foo label:
MATCH (:Foo)-[e:DummyEdge {edge_id:"123"}]->()
SET e.property="value"
With the above query, neo4j would only have to look at the outgoing relationships of Foo nodes, which is much faster since neo4j can quickly find nodes with a given label (or index).
Now, neo4j also supports full-text schema indexes, which do support relationship indexes. However, those kinds of indexes require much more effort on your part, and may be overkill for your use case.
There are now relationship - indexes that should spee up your operation massively.
https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/#administration-indexes-create-a-single-property-b-tree-index-for-relationships

Neo4j: Fast query for getting relationships between a set of nodes

I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.
I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).
For example, the following statement does the trick:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids})
RETURN collect(r)
Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.
If I use the START clause instead (shown below), the time is actually a little worse.
START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r)
I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?
UPDATE
Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.
NodeByIdSeek: 528 hits
Filter : 528 hits
Expand(All) : 73,773 hits
Filter : 73,245 hits
Projection : 0 hits
Filter : 0 hits
Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.
Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.
I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.
Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?
I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.
I'm curious how this would perform:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)
Also, why do the collect at the end?
How big are your id-lists?
Try this:
MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)
MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r
Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.

An effective way to lookup duplicate nodes in Neo4j 1.8?

I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8).
The nodes that need examination all have a (non-indexed) property externalId for which I want to find duplicates of. This is the Cypher query I've got:
START n=node(*), dup=node(*) WHERE
HAS(n.externalId) AND HAS(dup.externalId) AND
n.externalId=dup.externalId AND
ID(n) < ID(dup)
RETURN dup
There are less than 10K nodes in the data and less than 1K nodes with an externalId.
The query above is working but seems to perform badly. Is there a less memory consuming way to do this?
Try this query:
START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;
It avoids taking the Cartesian product of your nodes. It finds the distinct externalId values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.
The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)
I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:
START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)

Neo4j - slow cypher query - big graph with hierarchies

Using Neo4j 2.1.4. I have a graph with 'IS A' relationships (and other types of relationships) between nodes. I have some hierarchies inside the graph (IS A relationships) and I need to know the descendants (IS A relationship) of one hierarchy that has a particular-known relationship with some descendant of second hierarchy. If that particular-known relationship exists, I return the descendant/s of the first hierarchy.
INPUTS: 'ID_parentnode_hierarchy_01', 'ID_relationship', 'ID_parentnode_hierarchy_02'.
OUTPUT: Descendants (IS A relationship) of 'ID_parentnode_hierarchy_01' that has 'ID_relationship' with some descendant of 'ID_parentnode_hierarchy_02'.
Note: The graph has 500.000 nodes and 2 million relationships.
I am using this cypher query but it is very slow (aprox. 40s in a 4GB RAM and 3GHz Pentium Dual Core 64 bit PC). It is possible to build a faster query?
MATCH (parentnode_hierarchy_01: Node{nodeid : {ID_parentnode_hierarchy_01}})
WITH parentnode_hierarchy_01
MATCH (parentnode_hierarchy_01) <- [:REL* {reltype: {isA}}] - (descendants01: Node)
WITH descendants01
MATCH (descendants01) - [:REL {reltype: {ID_relationship}}] -> (descendants02: Node)
WITH descendants02, descendants01
MATCH (parentnode_hierarchy_02: Node {nodeid: {ID_parentnode_hierarchy_02} })
<- [:REL* {reltype: {isA}}] - (descendants02)
RETURN DISTINCT descendants01;
Thank you very much.
Well, I can slightly clean up your query - this might help us understand the issues better. I doubt this one will run faster, but using the cleaned up version we can discuss what's going on: (mostly eliminating unneeded uses of MATCH/WITH)
MATCH (parent:Node {nodeid: {ID_parentnode_hierarchy_01}})<-[:REL* {reltype:{isA}}]-
(descendants01:Node)-[:REL {reltype:{ID_relationship}}]->(descendants02:Node),
(parent2:Node {nodeid: {ID_parentnode_hierarchy_02}})<-[:REL* {reltype:{isA}}]-
(descendants02)
RETURN distinct descendants01;
This looks like you're searching two (probably large) trees, starting from the root, for two nodes somewhere in the tree that are linked by an {ID_relationship}.
Unless you can provide some query hints about which node in the tree might have an ID_relationship or something like that, at worst, this looks like you could end up comparing every two nodes in the two trees. So this looks like it could take n * k time, where n is the number of nodes in the first tree, k the number of nodes in the second tree.
Here are some strategy things to think about - which you should use depends on your data:
Is there some depth in the tree where these links are likely to be found? Can you put a range on the depth of [:REL* {reltype:{isA}}]?
What other criteria can you add to descendants01 and descendants02? Is there anything that can help make the query more selective so that you're not comparing every node in one tree to every node in the other?
Another strategy you might try is this: (this might be a horrible idea, but it's worth trying) -- basically look for a path from one root to the other, over any number of undirected edges of either isa type, or the other. Your data model has :REL relationships with a reltype attribute. This is probably an antipattern; instead of a reltype attribute, why is the relationship type not just that? This prevents the query that I want to write, below:
MATCH p=shortestPath((p1:Node {nodeid: {first_parent_id}})-[:isA|ID_relationship*]-(p2:Node {nodeid: {second_parent_id}}))
return p;
This would return the path from one "root" to the other, via the bridge you want. You could then use path functions to extract whatever nodes you wanted. Note that this query isn't possible currently because of your data model.

What are labels and indices in Neo4j?

I am using neo4j-core gem (Neo4j::Node API). It is the only MRI-compatible Ruby binding of neo4j that I could find, and hence is valuable, but its documentation is a crap (it has missing links, lots of typographical errors, and is difficult to comprehend). In the Label and Index Support section of the first link, it says:
Create a node with an [sic] label person and one property
Neo4j::Node.create({name: 'kalle'}, :person)
Add index on a label
person = Label.create(:person)
person.create_index(:name)
drop index
person.drop_index(:name)
(whose second code line I believe is a typographical error of the following)
person = Node4j::Label.create(:person)
What is a label, is it the name of a database table, or is it an attribute peculiar to a node?
If it is the name of a node, I don't under the fact that (according to the API in the second link) the method Neo4j::Node.create and Neo4j::Node#add_label can take multiple arguments for the label. What does it mean to have multiple labels on a node?
Furthermore, If I repeat the create command with the same label argument, it creates a different node object each time. What does it mean to have multiple nodes with the same name? Isn't a label something to identify a node?
What is index? How are labels and indices different?
Labels are a way of grouping nodes. You can give the label to many nodes or just one node. Think of it as a collection of nodes that are grouped together. They allow you to assign indexes and other constraints.
An index allows quick lookup of nodes or edges without having to traverse the entire graph to find them. Think of it as a table of direct pointers to the particular nodes/edges indexed.
As I read what you pasted from the docs (and without, admittedly, knowing the slightest thing about neo4j):
It's a graph database, where every piece of data is a node with a certain amount of properties.
Each node can have a label (or more, presumably?). Think of it as a type -- or perhaps more appropriately, in Ruby parlance, a Module.
It's a database, so nodes can be part of an index for quicker access. So can subsets of nodes, and therefor nodes with a certain label.
Put another way: Think of the label as the table in a DB. Nodes as DB rows, which can belong to one or more labels/tables, or no label/table at all for that matter. And indexes as DB indexes on sets of rows.

Resources