I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.
I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).
For example, the following statement does the trick:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids})
RETURN collect(r)
Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.
If I use the START clause instead (shown below), the time is actually a little worse.
START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r)
I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?
UPDATE
Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.
NodeByIdSeek: 528 hits
Filter : 528 hits
Expand(All) : 73,773 hits
Filter : 73,245 hits
Projection : 0 hits
Filter : 0 hits
Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.
Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.
I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.
Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?
I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.
I'm curious how this would perform:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)
Also, why do the collect at the end?
How big are your id-lists?
Try this:
MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)
MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r
Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.
Related
I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.
I have a query like this as a key component of my application:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group)
RETURN node
There is an index on :GroupType(Name)
In a database of roughly 10,000 elements this query uses nearly 1 million database hits. Here is the PROFILE of the query:
However, this slight variation of the query which performs an identical search is MUCH faster:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)-[:MEMBER_OF]->(group)
RETURN node
The only difference is the node:NodeType match and the relationship match are merged into a single MATCH instead of a MATCH ... WHERE. This query uses 1/70th of the database hits of the previous query and is more than 10 times faster, despite performing an identical search:
I thought Cypher treated MATCH ... WHERE statements as single search expressions, so the two queries should compile to identical operations, but these two queries seem to be performing vastly different operations. Why is this?
I would like to start by saying that this is not actually a Cypher problem. Cypher describes what you want, not how to get it, so the performance of this query will very vastly between say, Neo4J 3.1.1 and Neo4J 3.2.3.
As the one executing the Cypher is the one that decides how to do this, the real question is "Why doesn't the Neo4J Cypher planner not treat these the same?"
Ideally, both of these Cyphers should be equivalent to
MATCH (node:NodeType)-[:MEMBER_OF]->(group:GroupType{name:"String"})
RETURN node
because they should all produce the same results.
In reality, there are a lot of subtle nuances with dynamically parsing a query that has very many 'equivalent' expressions. But a subtle shift in context can change that equivalence, say if you did this adjustment
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group) OR SIZE(group.members) = 1
RETURN node
Now the two queries are almost nothing alike in their results. In order to scale, the query planner must make decision shortcuts to come up with an efficient plan as quickly as possible.
In sort, the performance depends on what the server you are throwing it at is running because coming up with an actionable lookup strategy for a language that lets you ask for ANYTHING/EVERYTHING is hard!
RELATED READING
Optimizing performance
What is Cypher?
MATCH ... WHERE <pattern> isn't the same as MATCH <pattern>.
The first query performs the match, then uses the pattern as a filter to perform for all built up rows.
You can see in the query plan that what's happening is a cartesian product between your first match results and all :NodeType nodes. Then for each row of the cartesian product, the WHERE checks to see if the the :GroupType node on that row has is connected to the :NodeType node on that row by the given pattern (this is the Expand(Into) operation).
The second query, by contrast, expands the pattern from the previously matched group nodes, so the nodes considered from the expansion are far less in number and almost immediately relevant, only requiring a final filter to ensure that those nodes are :NodeType nodes.
EDIT
As Tezra points out, Cypher operates by having you define what you want, not how to get it, as the "how" is the planner's job. In the current versions of Neo4j (3.2.3), my explanation stands, in that the planner interprets each of the queries differently and generates different plans for each, but that may be subject to change as Cypher evolves and the planner improves.
In these cases, you should be running PROFILEs on your queries and tuning accordingly.
I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);
I have a tree with 80,000 nodes and 4M leafs. The leafs are assigned to the tree nodes by 29M relations. In fact i have around 4 trees where the leafs are assigned to different nodes but that does not matter.
After about 6 days of work i figured out how to import such amount of data into neo4j within acceptable time and a lot of cases (csv import neo4j 2.1) where the neo4j process stuck at 100% and does not seem to do anything. I'm now creating the database with this tool:
https://github.com/jexp/batch-import/tree/20
which is VERY fast!
Now i finally got my database and started with a simple query like "how many leafs has a specific node":
MATCH (n:Node {id:123})-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);
i created an index on the "id" property but still this query takes 52 seconds.
It seems like the relation (without propertys) is not indexed at all...
Is there a way to make this faster?
The relationships don't have to be indexed.
Did you create an index like this:
create index on :Node(id);
I recommend that you add a direction to your arrow otherwise you will follow all relationship up and down the tree.
MATCH (n:Node {id:123})<-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);
I have a table (natomr) with 200 records which defines different areas. I want to find out what area(s) that contains an arbitrary point. This is my SQL:
SELECT *
FROM natomr
WHERE ST_DWithin(the_geom4326,
ST_geomfromtext('POINT(13.614807 59.684035)', 4326)::geography, 1)
This query takes about 1200 ms, which I assume is way too long for such small table.
I have created an index for the_geom4326, like this:
CREATE INDEX natomr_the_geom4326_gist
ON natomr
USING gist
(the_geom4326 );
I have also run VACUUM FULL command, but that did not have any effect.
What should I do to speed up the query?
Hard to tell if this is unexpected or not from what you have here...1200MS might be expected.
Auto vacuum prevents data wrap...shouldn't have a speed effect on a table this small
Table is almost too small for the index to really be effective.
Some potentials:
st_dwithin has a certain amount of overhead associated with it...it is composed of 3 calls of two other functions that are entirely contrib library files (in C). So your run time is going to look something like overhead + x seconds per record processed. Try scaling your data up a bit...try 10 points in a single query. This will give you a better idea of the overhead associated with st_dwithin.
How big are the polygons in the shape files? As an interesting test, try defining a 5 point polygon and attempt do the query to find a point in that polygon. Now define a 2000 point polygon and try the same test. If your 200 polygons here are 2000 points and larger, 1200MS doesn't sound too unreasonable depending on the power of your machine.