Performance issues related to Neo4j path query - performance

I have the following cypher query that basically is trying to find paths between the same set of nodes such that the paths returned contain all 5 specified relationships.
match p=(n)-[r*1..10]->(m)
where (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND filter(x IN r where type(x)=~'.*isOfFormat.*')
AND filter(y IN r where type(y)=~'.*Processors.*')
AND filter(z IN r where type(z)=~'.*hasProducts.*')
AND filter(u IN r where type(u)=~'.*ProcessorFamilies.*')
AND filter(v IN r where type(v)=~'.*hasProductCategory.*')
return p;
The query I had above worked just fine and I got the paths I wanted. However, the execution time for the query was quite long. Below is some information about the query and the graph I used:
1) the graph contains 107,387 nodes and 226,468 relationships;
2) the size of the set of source(destination) nodes is 120; in other words, there are 120 strings in (n.URI IN ['x86_64/2#this', ... ,'/CSP52369']) and (m.URI IN ['x86_64/2#this', ...,'/CSP52369'];
The query execution time for the above query is 212,840 ms.
Then, in order to find nodes with the URI property faster, I use a label Uri for URI property and create an index on :Uri(URI). Then, I modified the query and the new query looks like:
match p=(n:URI)-[r*1..10]->(m:URI)
where (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND filter(x IN r where type(x)=~'.*isOfFormat.*')
AND filter(y IN r where type(y)=~'.*Processors.*')
AND filter(z IN r where type(z)=~'.*hasProducts.*')
AND filter(u IN r where type(u)=~'.*ProcessorFamilies.*')
AND filter(v IN r where type(v)=~'.*hasProductCategory.*')
return p;
I ran the query again and the execution time was 5,841 ms. It did improve the performance a lot. However, I am not sure how the index helped here. I actually profiled both queries. Below are what I got.
The figure on the top/bottom is profiling result for the first/second query.
By comparing the two execution plans, I didn't see any operators related to index such as "NodeIndexSeek". Further, according to both plans, the system actually first computed all paths between n and m, then chose the ones to keep with the filter. Then, in this case, how would index help?
Can anybody help me clear up my doubts? Thanks in advance!!!

It seems your query runs with the RULE based optimizer while it should run with the cost based one? I hope you use the latest Neo4j version 2.3.1
I would also change your filter (which is not really a predicate) into:
You might need to add the index lookup hint:
WITH ["isOfFormat","Processors","hasProducts","ProcessorFamilies","hasProductCategory"] as types
MATCH p=(n:URI)-[rels*1..10]->(m:URI)
USING INDEX n:URI(URI)
USING INDEX m:URI(URI)
WHERE (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND ALL(t in types WHERE ANY(r in rels WHERE type(r) = t))
RETURN p;
But you might be better off to express your concrete path as a concrete pattern with the relevant rel-types in between!!
Like Nicole suggested:
MATCH (n:URI)-[rels:isOfFormat|:Processors|:hasProducts|
:ProcessorFamilies|:hasProductCategory*..10]-(m:URI)
...

Related

Cypher: slow query optimization

I am using redisgraph with a custom implementation of ioredis.
The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.
(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
The full query would look like this.
MATCH (b:brand)
WHERE b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE count >= 0 AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount
How can I optimize this query as it's really slow?
A few thoughts:
Property lookups are expensive; is there a way you can get around all the .deleted checks?
If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
I don't know if it'll help much, but you don't need to name p and d since you never refer to them
This is a very small improvement, but I don't see a reason to check count >= 0
Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.
EDIT/UPDATE: A few things I said need clarification:
First bullet:
The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.)
The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.
Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.
Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.

Optimize Neo4j cypher query on huge dataset

The following query can't run on a dataset with ~2M nodes. What should i do to make it run faster?
MATCH (cc:ConComp)-[r1:IN_CONCOMP]-(p1:Person)-[r2:SAME_CLUSTER]-(p2:Person)
WHERE cc.cluster_type = "household"
MERGE (cluster:Cluster {CLUSTER_TMP_ID:cc.CONCOMP_ID + '|' + r2.root_id, cluster_type:cc.cluster_type })
MERGE (cluster)-[r3:IN_CLUSTER]-(p1)
A number of suggestions:
adding directions to your relationships will decrease the number of paths in the MATCH
make sure that you have indexes on all properties that you MERGE on
in the second MERGE , also add direction.
I finally found a solution by using the following query (and by indexing cc.cluster_type and cc.CONCOMP_ID):
CALL apoc.periodic.iterate('MATCH (cc:ConComp)<-[r1:IN_CONCOMP]-(p1:Person)-[r2:SAME_CLUSTER]-(p2:Person) WHERE cc.cluster_type = "household" WITH DISTINCT cc.CONCOMP_ID + "|" + r2.root_id as id_name, cc.cluster_type as cluster_type_name, p1 RETURN id_name, cluster_type_name, p1', '
MERGE (cluster:Cluster {CLUSTER_TMP_ID: id_name, cluster_type: cluster_type_name})
MERGE (cluster)-[r3:IN_CLUSTER]->(p1)', {batchSize:10000, parallel:false})
I precise that I had previously ran my initial question query with apoc.periodic.iterate without success.

NRediSearch - Getting total documents matched count

Is there a way to get a total results count when calling Aggregate function?
Note that I'm not using Aggregate function to aggregate results, but as an advanced search query, because Search function does not allow to sort by multiple fields.
RediSearch returns total documents matched count, but I can't find a way to get this number using NRediSearch library.
With NRediSearch
Using NRediSearch, you would need to build and execute aggregation that will run a GROUPBY 0 and the COUNT reducer, say you have a person-idx index and you want to count all the Person documents in Redis:
var client = new Client("person-idx", muxer.GetDatabase());
var result = await client.AggregateAsync(new AggregationBuilder().GroupBy(new List<string>(), new List<Reducer>{Reducers.Count()}));
Console.WriteLine(result.GetResults().First().Values.First());
Will get the count you are looking for.
With Redis.OM
There's a newer library Redis.OM which you can also use to make these aggregations a bit simpler, the same operation would be done with the following:
var peopleAggregations = provider.AggregationSet<Person>();
Console.WriteLine(peopleAggregations.Count());

Neo4j query taking long time

I am currently working on a social media site which exactly the same in terms of users' timeline, like user can follow, create, share the posts, block, unblock, etc. So for that, we have created 2 types of labels "User" and "Post" and have several relations like follow, block, private, etc.
currently, we have approximately 41000 nodes and 650000 relationships.
Hardware conf:
8 gb ram
2 core
50 GB HDD
1 Master and 2 Slave
and using the following query to get the users' timeline
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'}),(p:Post{is_deleted:'0'}),(po:User{user_id:p.owner_id})
WHERE (p.post_type = '1' OR p.post_type = '4') WITH n,p,po
WHERE po.is_active='1' AND (n)-[:CREATED{own_status:'1'}]->(p) OR
(n)-[:FOLLOWS{follow_status:'1'}]->(:User{is_active:'1'})-[:CREATED{own_status:'1'}]->(p)
OR (n)-[:FOLLOWS{follow_status:'1'}]->(:Keyword{is_deleted:'0'})-[:KEYWORD]->(p)
WITH n,p,po
OPTIONAL MATCH (n)-[fr:FOLLOWS]->(po)
WHERE fr.follow_status='1' WITH p,n,po,fr
WHERE NOT ((n)-[:FOLLOWS{is_blocked:true}]->(po) OR (n)-[:FOLLOWS{is_mute:true}]->(po)) WITH p,n,po,fr
WHERE NOT (n)<-[:FOLLOWS{is_blocked:true}]-(po) WITH p,n,po,fr
WHERE (fr is not null and toInteger(po.is_private) <= 1 AND po.user_id <> n.user_id)
OR (toInteger(po.is_private) <= 1 AND po.user_id = n.user_id)
OR (toInteger(po.is_private) = 0 AND po.user_id <> n.user_id) WITH p,n,po
RETURN p,po,SIZE(()-[:LIKED]->(p)) as likecount,
SIZE((n)-[:LIKED]->(p)) as likestatus,count(*) as postcount
ORDER BY p.created_at DESC
SKIP 0 LIMIT 10
This query takes more than 10 sec. which is too high
Here is Profile of the above query
Here is the index list
Can anybody suggest where am I doing wrong?
If you're trying to get a user's timeline, I would think you'd start with the specific user, then connect to other nodes via the relationships you're interested in. The current query isn't taking advantage of pattern matching or the connected nature of a graph database.
The first match statement of the query as it's currently written finds a specific user, then all Post nodes that have the property is_deleted:'0' and then all User nodes that are connected to any of the Post nodes. Searching this way is giving you more database hits (54,984) in the first middle Expand(All) than there are nodes in the database (41,000).
Where you should get the most lift in optimizing this query is to focus your search on the single user then expand out from there using the relationships:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
This will match the user and all qualifying posts connected to the user via a relationship. Note, if a user isn't connected to any qualifying posts, there won't be any matches even if that user does exist in the database.
If you only want to include certain relationship types, you can specify that in this first MATCH statement like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r:CREATED|FOLLOWS|KEYWORD]-(p:Post{is_deleted:'0'})
Or you can put it in the WHERE clause like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
WHERE type(r) in ['CREATED', 'FOLLOWS' , 'KEYWORD']
I didn't follow all your conditional statements (and I think you might be able to remove some of them once you convert it to pattern matching), but once you have your initial pattern you can add in whatever conditional statements you need. Example:
WHERE (p.post_type = '1' OR p.post_type = '4')
AND (r.own_status = '1' OR r.follow_status = '1')
AND NOT r.is_blocked = true
For more on pattern matching, check out section 2.9 of the Neo4j Cypher Manual.

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
RETURN t.id
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
Indexes
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
Update
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
ORDER BY c DESC
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.
I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
RETURN t.id
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.
JnBrymn,
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
WITH x
MATCH (x)-[:FOLLOWS]->(t:User)
WITH t
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
RETURN t.id
Grace and peace,
Jim

Resources