Cypher performance on matching million users - performance

I am using redisgraph and the query is simple. How do I make it faster for getting a list of countries like that?
> GRAPH.profile g "MATCH (u:user) return collect(distinct u.countryCode) as codes"
1) "Results | Records produced: 1, Execution time: 0.001353 ms"
2) " Aggregate | Records produced: 1, Execution time: 238.989679 ms"
3) " Node By Label Scan | (u:user) | Records produced: 833935, Execution time: 81.158457 ms"

Here's what your query is doing:
Finding every user node in the graph. If you have these nodes indexed, then it'll be faster, but index lookups are always slower than graph traversals. You're doing 833,935 index lookups in 81ms.
Looking up every country code on each node. Property retrieval also takes time, but the bulk of the time here is dropping duplicate records. There are only 180 or so countries, so about 833k of those user nodes didn't contribute to your end result. This took 239ms.
Returning results: super fast.
I don't see a great way to speed this up, with the graph designed as is. Make sure user nodes and countryCode are indexed though. You could consider splitting out Country as its own node type, and then you can just match (c:Country). However, you run the risk of creating dense nodes because the USA, for example, probably has more users than Albania.
If you're going to need a list of country codes often and you can't alter the graph, then you could look at trickier things like adding a :FirstInCountry label to :user nodes or setting node ids as like 10000 - 10180 for unique country code user sets.
Edit: I said the wrong thing originally. The initial :user lookup is based on the label store, so an index there is irrelevant.

Related

Neo4j optimization: Query for all graphs from selected to selected nodes

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

Cypher recommendation query performance

I am working with rNeo4j for a recommendation application and I am having some issues writing an efficient query. The goal of the query is to recommend an item to a user, with the stipulation that they have not used the item before.
I want to return the item's name, the nodes on the path (for a visualization of the recommendation), and some additional measures to be able to make the recommendation as relevant as possible. Currently I'm returning the number of users that have used the item before, the length of the path to the recommendation, and a sum of the qCount relationship property.
Current query:
MATCH (subject:User {id: {idQ}), (rec:Item),
p = shortestPath((subject)-[*]-(rec))
WHERE NOT (subject)-[:ACCESSED]->(rec)
MATCH (users:User)-[:ACCESSED]->(rec)
RETURN rec.Name as Item,
count(users) as popularity,
length(p) as pathLength,
reduce(weight = 0, q IN relationships(p)| weight + toInt(q.qCount)) as Strength,
nodes(p) as path
ORDER BY pathLength, Strength DESCENDING, popularity DESCENDING
LIMIT {resultLimit}
The query appears to be working correctly, but it takes too long for the desired application (around 8 seconds). Does anyone have some suggestions for how to improve my query's performance?
I am new to cypher so I apologize if it is something obvious to a more advanced user.
One thing to consider is specifying an upper bound on the variable length path pattern like this: p = shortestPath((subject)-[*2..5]->(rec)) This limits the number of relationships in the pattern to a maximum of 5. Without setting a maximum performance can be poor, as paths of all lengths are considered.
Another thing to consider: by summing the relationship property qCount across all nodes in the path and then sorting by this sum you are looking for the shortest weighted path. Neo4j includes some graph algorithms (such as Dijkstra) for finding these paths efficiently, however they are not exposed via Cypher. See this page for more info.

Neo4j: Fast query for getting relationships between a set of nodes

I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.
I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).
For example, the following statement does the trick:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids})
RETURN collect(r)
Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.
If I use the START clause instead (shown below), the time is actually a little worse.
START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r)
I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?
UPDATE
Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.
NodeByIdSeek: 528 hits
Filter : 528 hits
Expand(All) : 73,773 hits
Filter : 73,245 hits
Projection : 0 hits
Filter : 0 hits
Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.
Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.
I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.
Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?
I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.
I'm curious how this would perform:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)
Also, why do the collect at the end?
How big are your id-lists?
Try this:
MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)
MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r
Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.

An effective way to lookup duplicate nodes in Neo4j 1.8?

I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8).
The nodes that need examination all have a (non-indexed) property externalId for which I want to find duplicates of. This is the Cypher query I've got:
START n=node(*), dup=node(*) WHERE
HAS(n.externalId) AND HAS(dup.externalId) AND
n.externalId=dup.externalId AND
ID(n) < ID(dup)
RETURN dup
There are less than 10K nodes in the data and less than 1K nodes with an externalId.
The query above is working but seems to perform badly. Is there a less memory consuming way to do this?
Try this query:
START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;
It avoids taking the Cartesian product of your nodes. It finds the distinct externalId values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.
The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)
I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:
START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)

Efficient point-in-time query of group membership

We have a scenario like this:
Millions of records (Record 1, Record 2, Record 3...)
Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
To which group did Record 15544 belong, three years ago? (Call this Group g).
What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
Find all records that have at any time belonged to Group g.
For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.
How about creating two tables:
(recordID, time-> groupID) - key is recordID, time - sorted by
recordID, and secondary by time (Let that be map1)
(groupID, time-> List) - key is groupID, time - sorted by
recordID, and secondary by time (Let that be map2)
At each record change:
Retrieve the current groupID of the record you are changing
set t <- current time
create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
Add a new entry (recordId,t,newGroupId) to map1
During query:
You need to find the entry in map2 that is 'closest' and smaller than
(recordId,desired_time) - this is classic O(logN) operation in
sorted data structure.
This will give you the group g the element belonged to at the desired time.
Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.
This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.
An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.

Resources