Cypher: slow query optimization - performance

I am using redisgraph with a custom implementation of ioredis.
The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.
(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
The full query would look like this.
MATCH (b:brand)
WHERE b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE count >= 0 AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount
How can I optimize this query as it's really slow?

A few thoughts:
Property lookups are expensive; is there a way you can get around all the .deleted checks?
If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
I don't know if it'll help much, but you don't need to name p and d since you never refer to them
This is a very small improvement, but I don't see a reason to check count >= 0
Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.
EDIT/UPDATE: A few things I said need clarification:
First bullet:
The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.)
The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.
Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.
Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.

Related

Neo4j cypher query improvement (performance)

I have the following cypher query:
CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus'
WITH node, weight
MATCH (selected:ontoterm{corpus:'my_corpus'})-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node:ontoterm{corpus:'my_corpus'})
WHERE selected.uri = 'http://uri1'
OR selected.uri = 'http://uri2'
OR selected.uri = 'http://uri3'
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
The first part (until the WITH) runs very fast (Lucene legacy index) and returns ~100 nodes. The uri property is also unique (selected = 3 nodes)
I have ~300 WEBSITE nodes. The execution time is 48749 ms.
Profile:
How can I restructure the query to improve performance? And why there are ~13.8 Mio rows in the profile?
I think the problem was in the WITH clause which expanded the results enormous. InverseFalcon's answer makes the query faster: 49 -> 18 sec (but still not fast enough). To avoid the enormous expand I collected the websites. The following query takes 60ms
MATCH (selected:ontoterm)-[:spotted_in]->(w:WEBSITE)
WHERE selected.uri in ['http://avgl.net/carbon_terms/Faser', 'http://avgl.net/carbon_terms/Carbon', 'http://avgl.net/carbon_terms/Leichtbau']
AND selected.corpus = 'carbon_terms'
with collect(distinct(w)) as websites
CALL apoc.index.nodes('node_auto_index','pref_label:(Fas OR Fas*)^10 OR pref_label_deco:(Fas OR Fas*)^3 OR alt_label:(Fa)^5') YIELD node, weight
WHERE node.corpus = 'carbon_terms' AND node:ontoterm
WITH websites, node, weight
match (node)-[:spotted_in]->(w:WEBSITE)
where w in websites
return node, weight
ORDER BY weight DESC
LIMIT 10
I don't see any occurrence of NodeUniqueIndexSeek in your plan, so the selected node isn't being looked up efficiently.
Make sure you have a unique constraint on :ontoterm(uri).
After the unique constraint is up, give this a try:
PROFILE CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus' AND node:ontoterm
WITH node, weight
MATCH (selected:ontoterm)
WHERE selected.uri in ['http://uri1', 'http://uri2', 'http://uri3']
AND selected.corpus = 'my_corpus'
WITH node, weight, selected
MATCH (selected)-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node)
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
Take a look at the query plan. You should see a NodeUniqueIndexSeek somewhere in there, and hopefully you should see a drop in db hits.

Neo4j performance with cycles

I have a relatively large neo4j graph with 7 millions vertices and 5 millions of relations.
When I try to find out subtree size for one node neo4j is stuck in traversing 600,000 nodes, only 130 of whom are unique.
It does it because of cycles.
Looks like it applies distinct only after it traverses the whole graph to maximum depth.
Is it possible to change this behaviour somehow?
The query is:
match (a1)-[o1*1..]->(a2) WHERE a1.id = '123' RETURN distinct a2
You can iteratively step through the subgraph a "layer" at a time while avoiding reprocessing the same node multiple times, by using the APOC procedure apoc.periodic.commit. That procedure iteratively processes a query until it returns 0.
Here is a example of this technique. It:
Uses a temporary TempNode node to keep track of a couple of important values between iterations, one of which will eventually contain the disinct ids of the nodes in the subgraph (except for the "root" node's id, since your question's query also leaves that out).
Assumes that all the nodes you care about share the same label, Foo, and that you have an index on Foo(id). This is for speeding up the MATCH operations, and is not strictly necessary.
Step 1: Create TempNode (using MERGE, to reuse existing node, if any)
WITH '123' AS rootId
MERGE (temp:TempNode)
SET temp.allIds = [rootId], temp.layerIds = [rootId];
Step 2: Perform iterations (to get all subgraph nodes)
CALL apoc.periodic.commit("
MATCH (temp:TempNode)
UNWIND temp.layerIds AS id
MATCH (n:Foo) WHERE n.id = id
OPTIONAL MATCH (n)-->(next)
WHERE NOT next.id IN temp.allIds
WITH temp, COLLECT(DISTINCT next.id) AS layerIds
SET temp.allIds = temp.allIds + layerIds, temp.layerIds = layerIds
RETURN SIZE(layerIds);
");
Step 3: Use subgraph ids
MATCH (temp:TempNode)
// ... use temp.allIds, which contains the distinct ids in the subgraph ...

Slicing neo4j Cypher results in chunks

I want to slice Cypher results in chunks of 100 rows, and be able to retrieve a specific chunk.
At the moment, the only way to ensure that rows are not mixed-up is to user ORDER BY which makes the query very inefficient ( 3sec. for me is too much)
MATCH (p:Person) RETURN p.id ORDER BY p.id SKIP {chunk}*100 LIMIT 100
where {chunk} is an external parameter to identify a specific chunk.
Any suggestions?
PS: the property p.id is indexed.
You may try something like adding label to Person before extracting chunks and then using query like
Match (p:Chunk:Person) with p LIMIT 100
Match (p) remove p:Chunk
Return *
If the p.id values are unique and dense (say, the value starts at 1 and increments, without any gaps), then this query will take advantage of the index on :Person(id) to efficiently get each hundred-Person chunk:
WITH (({chunk} - 1) * 100 + 1) AS startId
MATCH (p:Person)
WHERE p.id IN RANGE(startId, startId + 99)
RETURN p.id
ORDER BY p.id
Now, practically speaking, your id space will probably not remain dense, even if it started out that way. Person nodes will be deleted over time. In that case, the above query can return fewer than 100 rows. So, you can make your chunk size bigger than 100 and do some post-processing to get the 100 you need. In the worst case, you may need to make multiple requests to get the 100 you need, but each request will be fast. (Ideally, you would want to assign no-longer-unused id values to new Person nodes, to fill up gaps in the id space -- but this would require you to scan for the gaps.)

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
RETURN t.id
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
Indexes
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
Update
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
ORDER BY c DESC
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.
I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
RETURN t.id
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.
JnBrymn,
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
WITH x
MATCH (x)-[:FOLLOWS]->(t:User)
WITH t
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
RETURN t.id
Grace and peace,
Jim

pl/sql: Functions

I have three column value in excel sheet
A: # of unsuccessful transfers to CCR (CTI) =11986
B: # of calls NOT wrapped =8585
C: # of wrapped calls= 15283
and total of the three column is # of incoming calls(CTI)= 37017( this is sum of # of wrapped calls + # of unsuccessful transfers to CCR (CTI) + # of calls NOT wrapped)
I also calculate # of unaccounted calls(This is substracion of # of wrapped calls - # of unsuccessful transfers to CCR (CTI) - # of calls NOT wrapped)
So my # of unaccounted calls = 1163
Now i have to find out percentage of uncounted calls so i divide 37017/1163
So my percentatge is 3%, ideally it should be 0%, how do i find out in oracle that out of 3% what percent falls in A, B or C.
A B C comes from database qry, and the source is same but bunch of
different filters for each qry for A B and C
That might allow you spot a pattern in the rows that aren't picked up by any A, B or C, though you'd still need to work out which of the three queries you would have expected each row (or pattern of rows) to have been picked up by, and why they were missed.
Since the sum of the counts from the three queries with additional filters is lower than the count from the query without those filters, you seem to have a gap in the filters themselves. If I had to guess then the first place I'd look is for incorrect handling of null values, trying to equate them (since null is neither equal or not equal to anything, even itself). But that's clearly speculation, and without seeing the filters and knowing which columns can be null isn't very helpful.
You can maybe isolate the 1163 rows that aren't showing up by using minus to find the rows picked up by the 'total' query and not included by any of those producing A, B and C; something like:
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for A>
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for B>
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for C>
I'm curious about you having a distinct in your initial query though, since it suggests you're counting switches calls are made from rather than the calls themselves. It also might mean the counts should not add up - though in that case I'd perhaps expect A+B+C to be greater than the simple as there would be the potential for overlaps - and that select * might actually return more than 1163 rows; in which case you might only want to select the columns you think might be a problem.
Incidentally, if creation_date is indexed then you might get better performance with where creation_date >= date '2012-07-01' and creation_date < date '2012-08-01', as the trunk() function woudl prevent the index being used. Might not be an issue for you though.

Resources