I have a tree with 80,000 nodes and 4M leafs. The leafs are assigned to the tree nodes by 29M relations. In fact i have around 4 trees where the leafs are assigned to different nodes but that does not matter.
After about 6 days of work i figured out how to import such amount of data into neo4j within acceptable time and a lot of cases (csv import neo4j 2.1) where the neo4j process stuck at 100% and does not seem to do anything. I'm now creating the database with this tool:
https://github.com/jexp/batch-import/tree/20
which is VERY fast!
Now i finally got my database and started with a simple query like "how many leafs has a specific node":
MATCH (n:Node {id:123})-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);
i created an index on the "id" property but still this query takes 52 seconds.
It seems like the relation (without propertys) is not indexed at all...
Is there a way to make this faster?
The relationships don't have to be indexed.
Did you create an index like this:
create index on :Node(id);
I recommend that you add a direction to your arrow otherwise you will follow all relationship up and down the tree.
MATCH (n:Node {id:123})<-[:ASSIGNED]-(l:Leaf) RETURN COUNT(l);
Related
I have a database with 500K nodes and 700K relationships. I created 500 additional relationships with a new typeDummyEdge with edge_id attributes from "1" to "500". Now I want to query and modify these relationships. Running a query
MATCH ()-[e:DummyEdge {edge_id:"123"}]->() SET e.property="value" is really slow, it takes around 300ms, so if I run 500 such queries, it takes around 2-3 minutes. I also called CREATE INDEX ON :DummyEdge(edge_id) but it didn't speed up the query execution.
Is there any way to make such bulk relationship modification faster?
CREATE INDEX creates an index for nodes, so such an index would make no difference in the performance of your query.
Since your MATCH pattern, ()-[e:DummyEdge {edge_id:"123"}]->(), provided no information about the end nodes, neo4j has to scan every relationship in the DB to find the ones you want. That is why your query is so slow.
It would be much more efficient if (as # MichaelHunger stated) your query provided useful information (like a label, or an indexed label/property pair) for either of the nodes in your MATCH pattern. That would help neo4j narrow down the number of relationships that need to be scanned. As an example, let's state that the start node must have the Foo label:
MATCH (:Foo)-[e:DummyEdge {edge_id:"123"}]->()
SET e.property="value"
With the above query, neo4j would only have to look at the outgoing relationships of Foo nodes, which is much faster since neo4j can quickly find nodes with a given label (or index).
Now, neo4j also supports full-text schema indexes, which do support relationship indexes. However, those kinds of indexes require much more effort on your part, and may be overkill for your use case.
There are now relationship - indexes that should spee up your operation massively.
https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/#administration-indexes-create-a-single-property-b-tree-index-for-relationships
I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.
I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);
I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.
I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).
For example, the following statement does the trick:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids})
RETURN collect(r)
Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.
If I use the START clause instead (shown below), the time is actually a little worse.
START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r)
I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?
UPDATE
Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.
NodeByIdSeek: 528 hits
Filter : 528 hits
Expand(All) : 73,773 hits
Filter : 73,245 hits
Projection : 0 hits
Filter : 0 hits
Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.
Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.
I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.
Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?
I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.
I'm curious how this would perform:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)
Also, why do the collect at the end?
How big are your id-lists?
Try this:
MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)
MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r
Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.
Using Neo4j 2.1.4. I have a graph with 'IS A' relationships (and other types of relationships) between nodes. I have some hierarchies inside the graph (IS A relationships) and I need to know the descendants (IS A relationship) of one hierarchy that has a particular-known relationship with some descendant of second hierarchy. If that particular-known relationship exists, I return the descendant/s of the first hierarchy.
INPUTS: 'ID_parentnode_hierarchy_01', 'ID_relationship', 'ID_parentnode_hierarchy_02'.
OUTPUT: Descendants (IS A relationship) of 'ID_parentnode_hierarchy_01' that has 'ID_relationship' with some descendant of 'ID_parentnode_hierarchy_02'.
Note: The graph has 500.000 nodes and 2 million relationships.
I am using this cypher query but it is very slow (aprox. 40s in a 4GB RAM and 3GHz Pentium Dual Core 64 bit PC). It is possible to build a faster query?
MATCH (parentnode_hierarchy_01: Node{nodeid : {ID_parentnode_hierarchy_01}})
WITH parentnode_hierarchy_01
MATCH (parentnode_hierarchy_01) <- [:REL* {reltype: {isA}}] - (descendants01: Node)
WITH descendants01
MATCH (descendants01) - [:REL {reltype: {ID_relationship}}] -> (descendants02: Node)
WITH descendants02, descendants01
MATCH (parentnode_hierarchy_02: Node {nodeid: {ID_parentnode_hierarchy_02} })
<- [:REL* {reltype: {isA}}] - (descendants02)
RETURN DISTINCT descendants01;
Thank you very much.
Well, I can slightly clean up your query - this might help us understand the issues better. I doubt this one will run faster, but using the cleaned up version we can discuss what's going on: (mostly eliminating unneeded uses of MATCH/WITH)
MATCH (parent:Node {nodeid: {ID_parentnode_hierarchy_01}})<-[:REL* {reltype:{isA}}]-
(descendants01:Node)-[:REL {reltype:{ID_relationship}}]->(descendants02:Node),
(parent2:Node {nodeid: {ID_parentnode_hierarchy_02}})<-[:REL* {reltype:{isA}}]-
(descendants02)
RETURN distinct descendants01;
This looks like you're searching two (probably large) trees, starting from the root, for two nodes somewhere in the tree that are linked by an {ID_relationship}.
Unless you can provide some query hints about which node in the tree might have an ID_relationship or something like that, at worst, this looks like you could end up comparing every two nodes in the two trees. So this looks like it could take n * k time, where n is the number of nodes in the first tree, k the number of nodes in the second tree.
Here are some strategy things to think about - which you should use depends on your data:
Is there some depth in the tree where these links are likely to be found? Can you put a range on the depth of [:REL* {reltype:{isA}}]?
What other criteria can you add to descendants01 and descendants02? Is there anything that can help make the query more selective so that you're not comparing every node in one tree to every node in the other?
Another strategy you might try is this: (this might be a horrible idea, but it's worth trying) -- basically look for a path from one root to the other, over any number of undirected edges of either isa type, or the other. Your data model has :REL relationships with a reltype attribute. This is probably an antipattern; instead of a reltype attribute, why is the relationship type not just that? This prevents the query that I want to write, below:
MATCH p=shortestPath((p1:Node {nodeid: {first_parent_id}})-[:isA|ID_relationship*]-(p2:Node {nodeid: {second_parent_id}}))
return p;
This would return the path from one "root" to the other, via the bridge you want. You could then use path functions to extract whatever nodes you wanted. Note that this query isn't possible currently because of your data model.