Neo4j performance with cycles - performance

I have a relatively large neo4j graph with 7 millions vertices and 5 millions of relations.
When I try to find out subtree size for one node neo4j is stuck in traversing 600,000 nodes, only 130 of whom are unique.
It does it because of cycles.
Looks like it applies distinct only after it traverses the whole graph to maximum depth.
Is it possible to change this behaviour somehow?
The query is:
match (a1)-[o1*1..]->(a2) WHERE a1.id = '123' RETURN distinct a2

You can iteratively step through the subgraph a "layer" at a time while avoiding reprocessing the same node multiple times, by using the APOC procedure apoc.periodic.commit. That procedure iteratively processes a query until it returns 0.
Here is a example of this technique. It:
Uses a temporary TempNode node to keep track of a couple of important values between iterations, one of which will eventually contain the disinct ids of the nodes in the subgraph (except for the "root" node's id, since your question's query also leaves that out).
Assumes that all the nodes you care about share the same label, Foo, and that you have an index on Foo(id). This is for speeding up the MATCH operations, and is not strictly necessary.
Step 1: Create TempNode (using MERGE, to reuse existing node, if any)
WITH '123' AS rootId
MERGE (temp:TempNode)
SET temp.allIds = [rootId], temp.layerIds = [rootId];
Step 2: Perform iterations (to get all subgraph nodes)
CALL apoc.periodic.commit("
MATCH (temp:TempNode)
UNWIND temp.layerIds AS id
MATCH (n:Foo) WHERE n.id = id
OPTIONAL MATCH (n)-->(next)
WHERE NOT next.id IN temp.allIds
WITH temp, COLLECT(DISTINCT next.id) AS layerIds
SET temp.allIds = temp.allIds + layerIds, temp.layerIds = layerIds
RETURN SIZE(layerIds);
");
Step 3: Use subgraph ids
MATCH (temp:TempNode)
// ... use temp.allIds, which contains the distinct ids in the subgraph ...

Related

Cypher: slow query optimization

I am using redisgraph with a custom implementation of ioredis.
The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.
(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
The full query would look like this.
MATCH (b:brand)
WHERE b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE count >= 0 AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount
How can I optimize this query as it's really slow?
A few thoughts:
Property lookups are expensive; is there a way you can get around all the .deleted checks?
If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
I don't know if it'll help much, but you don't need to name p and d since you never refer to them
This is a very small improvement, but I don't see a reason to check count >= 0
Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.
EDIT/UPDATE: A few things I said need clarification:
First bullet:
The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.)
The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.
Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.
Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.

How to find parent records efficiently based on both key and time intervals?

Definitions
A parent record has the type 'P', an ancestor key, a date interval.
Its child record has the type 'C', an identical ancestor key, and has a date interval that matches or falls within its parent's interval.
All records are unique
Parent records can share the same ancestor key, but their date intervals cannot overlap
A parent record can have many child records
Example
Parent records can be:
P, 12345, (1000-01-01, 1000-12-31)
P, 12345, (1001-01-01, 1001-12-31) // No overlapping dates, valid
Valid children for the first parent record can be
C, 12345, (1000-01-01, 1000-12-31) // Matches on everything, valid
C, 12345, (1000-05-05, 1000-09-09) // Matches on ancestor key, is within parent's date interval, valid
Problem
Given a randomized set of records with both parents and children, how can I efficiently categorize the set into different groups of one unique parent and its valid children based on both key and time intervals?
It is guaranteed that for every child, there is one and only one parent. but it is possible that a parent does not have any children.
Brute force solution
Identify all of the parent records in linear time. Then loop through them and pairwise match all of the other records in squared time.
Question
Is there a faster approach?
The easiest way is to make a list of P records and a list of C records, and then sort both of them by (ancestor_key, interval.start)
Then you walk through the parents list, and for each parent, extract it's children from the child list. Because of the sorting, the parent and child lists will be in corresponding order, so the position of interest in both lists will move only forward.
Total complexity is O(n log n), dominated by the sorting.

Neo4j cypher query improvement (performance)

I have the following cypher query:
CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus'
WITH node, weight
MATCH (selected:ontoterm{corpus:'my_corpus'})-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node:ontoterm{corpus:'my_corpus'})
WHERE selected.uri = 'http://uri1'
OR selected.uri = 'http://uri2'
OR selected.uri = 'http://uri3'
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
The first part (until the WITH) runs very fast (Lucene legacy index) and returns ~100 nodes. The uri property is also unique (selected = 3 nodes)
I have ~300 WEBSITE nodes. The execution time is 48749 ms.
Profile:
How can I restructure the query to improve performance? And why there are ~13.8 Mio rows in the profile?
I think the problem was in the WITH clause which expanded the results enormous. InverseFalcon's answer makes the query faster: 49 -> 18 sec (but still not fast enough). To avoid the enormous expand I collected the websites. The following query takes 60ms
MATCH (selected:ontoterm)-[:spotted_in]->(w:WEBSITE)
WHERE selected.uri in ['http://avgl.net/carbon_terms/Faser', 'http://avgl.net/carbon_terms/Carbon', 'http://avgl.net/carbon_terms/Leichtbau']
AND selected.corpus = 'carbon_terms'
with collect(distinct(w)) as websites
CALL apoc.index.nodes('node_auto_index','pref_label:(Fas OR Fas*)^10 OR pref_label_deco:(Fas OR Fas*)^3 OR alt_label:(Fa)^5') YIELD node, weight
WHERE node.corpus = 'carbon_terms' AND node:ontoterm
WITH websites, node, weight
match (node)-[:spotted_in]->(w:WEBSITE)
where w in websites
return node, weight
ORDER BY weight DESC
LIMIT 10
I don't see any occurrence of NodeUniqueIndexSeek in your plan, so the selected node isn't being looked up efficiently.
Make sure you have a unique constraint on :ontoterm(uri).
After the unique constraint is up, give this a try:
PROFILE CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus' AND node:ontoterm
WITH node, weight
MATCH (selected:ontoterm)
WHERE selected.uri in ['http://uri1', 'http://uri2', 'http://uri3']
AND selected.corpus = 'my_corpus'
WITH node, weight, selected
MATCH (selected)-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node)
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
Take a look at the query plan. You should see a NodeUniqueIndexSeek somewhere in there, and hopefully you should see a drop in db hits.

Efficient program to keep track of movies and search frequency?

In studying for my exam I came across this question.
A website streams movies to customers’ TVs or other devices. Movies are in one of several genres such as action, drama, mystery, etc. Every movie is in exactly one genre (so that if a movie is an action movie as well as a comedy, it is in a genre called “action-Comedy”). The site has around 10 million customers, and around 25,000 movies, but both are growing rapidly. The site wants to keep track of the most popular movies streamed. You have been hired as the lead engineer to develop a tracking program.
i) Every time a movie is streamed to a customer, its name (e.g. “Harold and Kumar: Escape from Guantanamo Bay”) and genre (“Comedy”) is sent to your program so it can update the data structures it maintains.
(Assume your program can get the current year with a call to an appropriate Java class, in O(1) time.)
ii) Also, every once in a while, customers want to know what were the top k most streamed movies in genre g in year y. (If y is the current year, then accounting is done up to the current date.) For example, what were the top 10 most streamed comedy movies in 2010? Here k = 10, g=”comeday” and y = 2010. This query is sent to your program which should output the top k movie names.
Describe the data structures and algorithms used to implement both requirements. For (i), analyze the big O running time to update the data structures, and for (ii) the big O running time to output the top k streamed movies.
My thought process was to create a hash table, with every new movie added to its respective genre in the hash table in a linked list. As for the second part, my only idea is to keep the linked list sorted but that seems way too expensive. What is a better alternative?
I use a heap to keep track of the top k objects of a class (k fixed). You can find the details of this data structure in any CS text, but basically it's a binary tree in which every node is smaller than either of its children. The main operation, which we will call reheap(node) assumes that both the children of node are heaps, compares node with the smaller of its two children, does the swap if necessary, and recursively calls reheap for the modified child. The class needs to have an overloaded operator< or the equivalent defined to do this.
At any point in time, the heap holds the top k objects with the smallest of these at the top of the heap. When a new object arrives which is bigger than the top of the heap, it replaces that object on the heap, and then
reheap is called. This can also happen at a node other than the top node if an object already on the heap becomes bigger than its smaller child. Another type of update occurs if an object already on the heap becomes smaller than its parent (this probably won't happen in the case you describe). Here it gets swapped with its parent and we then compare recursively against the grandparent, etc.
All of these updates have complexity O(log(k)). If you need to output the heap sorted from the top down, the same structure works well in time
O(k log(k)). (This process is known as heapsort).
Since swapping objects can be expensive, I usually keep the objects in a fixed array somewhere, and implement the heap as an array, A, of pointers, where the children of A[i] are A[2i+1] and A[2i+2].
You could do this in O(1) using one hash table "HT1" to map from (genre, year, movie_title) to an iterator into a linked list of (num_times_streamed, hash table of movie titles). You use the iterator to see if the next element in the list is for one greater streaming count and if so insert your movie title there and remove it from the other table (which if empty can be removed from the list), otherwise if the existing hash table has no other titles then increment the num_times_streamed, otherwise insert a new hash table in the list and add your title. Update the record of the iterator in HT1 as necessary.
Note that as described above the operations in the list use the end-points or an existing iterator to step through by no more than one position as the num_times_streamed value is incremented, so O(1).
To get the top k titles you'll need a hash table HT2 from { genre, year } to each of the linked lists: simply iterate from the end of the list and you'll encounter a hash table with a movie or movies with the highest streaming count, then the next highest and so on. If the year's just changed, you may not find k entries, handle that however you like. If when looking up a movie title it's found not to exist in HT1, you'd add a new list for that genre and the current year to HT2.
More visually, using { } around hash tables (whether mappings or sets), [ ] around linked lists, and ( ) around grouped struct/tuple data:
HT2 = { "comedy 2015": [ (1, { "title1", "title2" }),
(2, { "title3" }), <--------\
(4, { "title4" }) ], |
"drama 2012": [ (1, { "title5" }), |
(3, { "title6" }) ], |
... | .
}; | .
| .
HT1 = { "title3", -----------------------------------/ |
"title2", ---------------------------------------/
...
};

Neo4j - adding extra relationships between nodes in a list

I have a list of nodes representing a history of events for users forming, a following pattern:
()-[:s]->()-[:s]->() and so on
Each of the nodes of the list belongs to a user (is connected via a relationship).
I'm trying to create individual user histories (add a :succeeds_for_user relationship between all events that happend for a particular user, such that each event has only one consecutive event).
I was trying to do something like this to extract nodes that should be in a relationship:
start u = node:class(_class = "User")
match p = shortestPath(n-[:s*..]->m), n-[:belongs_to]-u-[:belongs_to]-m
where n <> m
with n, MIN(length(p)) as l
match p = n-[:s*1..]->m
where length(p) = l
return n._id, m._id, extract(x IN nodes(p): x._id)
but it is painfully slow.
Does anyone know a better way to do it?
Neo4j is calculating a lot of shortest paths there.
Assuming that you have a history start node (with for the purpose of my query has id x), you can get an ordered list of event nodes with corresponding user id like this:
"START n=node(x) # history start
MATCH p = n-[:FOLLOWS*1..]->(m)<-[:DID]-u # match from start up to user nodes
return u._id,
reduce(id=0,
n in filter(n in nodes(p): n._class != 'User'): n._id)
# get the id of the last node in the path that is not a User
order by length(p) # ordered by path length, thus place in history"
You can then iterate the result in your program and add relationships between nodes belonging to the same user. I don't have a fitting big dataset, but it might be faster.

Resources