Matching edge in Neo4j Cypher is really slow - performance

I have a database with 500K nodes and 700K relationships. I created 500 additional relationships with a new typeDummyEdge with edge_id attributes from "1" to "500". Now I want to query and modify these relationships. Running a query
MATCH ()-[e:DummyEdge {edge_id:"123"}]->() SET e.property="value" is really slow, it takes around 300ms, so if I run 500 such queries, it takes around 2-3 minutes. I also called CREATE INDEX ON :DummyEdge(edge_id) but it didn't speed up the query execution.
Is there any way to make such bulk relationship modification faster?

CREATE INDEX creates an index for nodes, so such an index would make no difference in the performance of your query.
Since your MATCH pattern, ()-[e:DummyEdge {edge_id:"123"}]->(), provided no information about the end nodes, neo4j has to scan every relationship in the DB to find the ones you want. That is why your query is so slow.
It would be much more efficient if (as # MichaelHunger stated) your query provided useful information (like a label, or an indexed label/property pair) for either of the nodes in your MATCH pattern. That would help neo4j narrow down the number of relationships that need to be scanned. As an example, let's state that the start node must have the Foo label:
MATCH (:Foo)-[e:DummyEdge {edge_id:"123"}]->()
SET e.property="value"
With the above query, neo4j would only have to look at the outgoing relationships of Foo nodes, which is much faster since neo4j can quickly find nodes with a given label (or index).
Now, neo4j also supports full-text schema indexes, which do support relationship indexes. However, those kinds of indexes require much more effort on your part, and may be overkill for your use case.

There are now relationship - indexes that should spee up your operation massively.
https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/#administration-indexes-create-a-single-property-b-tree-index-for-relationships

Related

Use parent child relation when I have more than 100,000 children on Elasticsearch

Can I use this way
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/parent-join.html
When I have more than 100,000 children for each parent
I could not find information about that limit in specific, but I think you could have problems for many reasons.
Elasticsearch warns about the low performance of this type of mapping
There is a max_children to return in a query that defaults to 10, probably will blow before the 100k
For nested objects the max is 10,000 this suggests the max for children docs will be similar
According to the docs , all the children must be indexed on the same shard
I would recommend to re-check your requirements and consider a flat schema. I can't say much more without knowing the data.

Why do these two seemingly identical Cypher queries differ so greatly in speed?

I have a query like this as a key component of my application:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group)
RETURN node
There is an index on :GroupType(Name)
In a database of roughly 10,000 elements this query uses nearly 1 million database hits. Here is the PROFILE of the query:
However, this slight variation of the query which performs an identical search is MUCH faster:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)-[:MEMBER_OF]->(group)
RETURN node
The only difference is the node:NodeType match and the relationship match are merged into a single MATCH instead of a MATCH ... WHERE. This query uses 1/70th of the database hits of the previous query and is more than 10 times faster, despite performing an identical search:
I thought Cypher treated MATCH ... WHERE statements as single search expressions, so the two queries should compile to identical operations, but these two queries seem to be performing vastly different operations. Why is this?
I would like to start by saying that this is not actually a Cypher problem. Cypher describes what you want, not how to get it, so the performance of this query will very vastly between say, Neo4J 3.1.1 and Neo4J 3.2.3.
As the one executing the Cypher is the one that decides how to do this, the real question is "Why doesn't the Neo4J Cypher planner not treat these the same?"
Ideally, both of these Cyphers should be equivalent to
MATCH (node:NodeType)-[:MEMBER_OF]->(group:GroupType{name:"String"})
RETURN node
because they should all produce the same results.
In reality, there are a lot of subtle nuances with dynamically parsing a query that has very many 'equivalent' expressions. But a subtle shift in context can change that equivalence, say if you did this adjustment
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group) OR SIZE(group.members) = 1
RETURN node
Now the two queries are almost nothing alike in their results. In order to scale, the query planner must make decision shortcuts to come up with an efficient plan as quickly as possible.
In sort, the performance depends on what the server you are throwing it at is running because coming up with an actionable lookup strategy for a language that lets you ask for ANYTHING/EVERYTHING is hard!
RELATED READING
Optimizing performance
What is Cypher?
MATCH ... WHERE <pattern> isn't the same as MATCH <pattern>.
The first query performs the match, then uses the pattern as a filter to perform for all built up rows.
You can see in the query plan that what's happening is a cartesian product between your first match results and all :NodeType nodes. Then for each row of the cartesian product, the WHERE checks to see if the the :GroupType node on that row has is connected to the :NodeType node on that row by the given pattern (this is the Expand(Into) operation).
The second query, by contrast, expands the pattern from the previously matched group nodes, so the nodes considered from the expansion are far less in number and almost immediately relevant, only requiring a final filter to ensure that those nodes are :NodeType nodes.
EDIT
As Tezra points out, Cypher operates by having you define what you want, not how to get it, as the "how" is the planner's job. In the current versions of Neo4j (3.2.3), my explanation stands, in that the planner interprets each of the queries differently and generates different plans for each, but that may be subject to change as Cypher evolves and the planner improves.
In these cases, you should be running PROFILEs on your queries and tuning accordingly.

Neo4j optimization: Query for all graphs from selected to selected nodes

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

neo4j performance with 4M nodes and 29M relations

I have a tree with 80,000 nodes and 4M leafs. The leafs are assigned to the tree nodes by 29M relations. In fact i have around 4 trees where the leafs are assigned to different nodes but that does not matter.
After about 6 days of work i figured out how to import such amount of data into neo4j within acceptable time and a lot of cases (csv import neo4j 2.1) where the neo4j process stuck at 100% and does not seem to do anything. I'm now creating the database with this tool:
https://github.com/jexp/batch-import/tree/20
which is VERY fast!
Now i finally got my database and started with a simple query like "how many leafs has a specific node":
MATCH (n:Node {id:123})-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);
i created an index on the "id" property but still this query takes 52 seconds.
It seems like the relation (without propertys) is not indexed at all...
Is there a way to make this faster?
The relationships don't have to be indexed.
Did you create an index like this:
create index on :Node(id);
I recommend that you add a direction to your arrow otherwise you will follow all relationship up and down the tree.
MATCH (n:Node {id:123})<-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);

Mongo db query subset

I currently have a MongoDB setup with a fairly large database (about 250m documents). At present, I have one main collection that has the majority of the data, which has a single index (time). This results in acceptable query times when only the time is in the where part of the query (the index is used).
The problem is when I need to use a compound key - the time index uses about 2.5GB of memory, and I only have 4GB on the server, so I don't want to create a compound key index since that will prevent all indexes from fitting in memory and thus slow things down a lot.
So my question is this: can I query first for time, and then query that subset for the other variables?
I should point out that I am using the Ruby driver.
At the moment, my query looks like this (this is very slow):
trade_stop_loss_time = ticks.find_one({
"time" => { "$gt" => trade_time_open, "$lte" => trade_time_close },
"bid" => { "$lte" => stop_loss_price }
}).sort({"time" => 1})
Thanks!
If you simply perform the query you present, the database should be smart enough to do exactly that.
The query you have should basically filter down the candidate set using the time index, then scan the remaining objects for the bid parameter. This should be a lot more efficient than doing the scan on the client.
You should definitely run explain() on your query to find out what it's doing. If it uses an index (BtreeCursor) and the number of scanned objects is just the number of items in the given time frame, it's doing fine. I don't think there's a better way than that, given your constraints. Doing the same operation on the client will definitely be slower.
Of course, a limit and a small time frame will help to make your query faster, but these might be external factors. mongostat might also help to find the problem.
However, if your documents and/or time spans are large, it might still be better to add the compound index: loading a lot of large documents from disk (since your RAM is already full) will take some time. Paging the index from disk is also slow, but it's much less data.
A good answer can be given only be experiment.
You could return the results using just the time index then filter them further client side? Other than that I think you're pretty much out of luck.

Resources