An effective way to lookup duplicate nodes in Neo4j 1.8?

An effective way to lookup duplicate nodes in Neo4j 1.8? - performance

I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8).
The nodes that need examination all have a (non-indexed) property externalId for which I want to find duplicates of. This is the Cypher query I've got:
START n=node(*), dup=node(*) WHERE
HAS(n.externalId) AND HAS(dup.externalId) AND
n.externalId=dup.externalId AND
ID(n) < ID(dup)
RETURN dup
There are less than 10K nodes in the data and less than 1K nodes with an externalId.
The query above is working but seems to perform badly. Is there a less memory consuming way to do this?

Try this query:
START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;
It avoids taking the Cartesian product of your nodes. It finds the distinct externalId values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.

The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)
I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:
START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)

Related

Cypher performance on matching million users

I am using redisgraph and the query is simple. How do I make it faster for getting a list of countries like that?
> GRAPH.profile g "MATCH (u:user) return collect(distinct u.countryCode) as codes"
1) "Results | Records produced: 1, Execution time: 0.001353 ms"
2) " Aggregate | Records produced: 1, Execution time: 238.989679 ms"
3) " Node By Label Scan | (u:user) | Records produced: 833935, Execution time: 81.158457 ms"

Here's what your query is doing:
Finding every user node in the graph. If you have these nodes indexed, then it'll be faster, but index lookups are always slower than graph traversals. You're doing 833,935 index lookups in 81ms.
Looking up every country code on each node. Property retrieval also takes time, but the bulk of the time here is dropping duplicate records. There are only 180 or so countries, so about 833k of those user nodes didn't contribute to your end result. This took 239ms.
Returning results: super fast.
I don't see a great way to speed this up, with the graph designed as is. Make sure user nodes and countryCode are indexed though. You could consider splitting out Country as its own node type, and then you can just match (c:Country). However, you run the risk of creating dense nodes because the USA, for example, probably has more users than Albania.
If you're going to need a list of country codes often and you can't alter the graph, then you could look at trickier things like adding a :FirstInCountry label to :user nodes or setting node ids as like 10000 - 10180 for unique country code user sets.
Edit: I said the wrong thing originally. The initial :user lookup is based on the label store, so an index there is irrelevant.

Neo4j optimization: Query for all graphs from selected to selected nodes

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel

If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

Cypher recommendation query performance

I am working with rNeo4j for a recommendation application and I am having some issues writing an efficient query. The goal of the query is to recommend an item to a user, with the stipulation that they have not used the item before.
I want to return the item's name, the nodes on the path (for a visualization of the recommendation), and some additional measures to be able to make the recommendation as relevant as possible. Currently I'm returning the number of users that have used the item before, the length of the path to the recommendation, and a sum of the qCount relationship property.
Current query:
MATCH (subject:User {id: {idQ}), (rec:Item),
p = shortestPath((subject)-[*]-(rec))
WHERE NOT (subject)-[:ACCESSED]->(rec)
MATCH (users:User)-[:ACCESSED]->(rec)
RETURN rec.Name as Item,
count(users) as popularity,
length(p) as pathLength,
reduce(weight = 0, q IN relationships(p)| weight + toInt(q.qCount)) as Strength,
nodes(p) as path
ORDER BY pathLength, Strength DESCENDING, popularity DESCENDING
LIMIT {resultLimit}
The query appears to be working correctly, but it takes too long for the desired application (around 8 seconds). Does anyone have some suggestions for how to improve my query's performance?
I am new to cypher so I apologize if it is something obvious to a more advanced user.

One thing to consider is specifying an upper bound on the variable length path pattern like this: p = shortestPath((subject)-[*2..5]->(rec)) This limits the number of relationships in the pattern to a maximum of 5. Without setting a maximum performance can be poor, as paths of all lengths are considered.
Another thing to consider: by summing the relationship property qCount across all nodes in the path and then sorting by this sum you are looking for the shortest weighted path. Neo4j includes some graph algorithms (such as Dijkstra) for finding these paths efficiently, however they are not exposed via Cypher. See this page for more info.

Paging in Elasticsearch when results have equal scores

Is it possible to implement reliable paging of elasticsearch search results if multiple documents have equal scores?
I'm experimenting with custom scoring in elasticsearch. Many of the scoring expressions I try yield result sets where many documents have equal scores. They seem to come in the same order each time I try, but can it be guaranteed?
AFAIU it can't, especially not if there is more than one shard in a cluster. Documents with equal score wrt. a given elasticsearch query are returned in random, non-deterministic order that can change between invocations of the same query, even if the underlying database does not change (and therefore paging is unreliable) unless one of the following holds:
I use function_score to guarantee that the score is unique for each document (e.g. by using a unique number field).
I use sort and guarantee that the sorting defines a total order (e.g. by using a unique field as fallback if everything else is equal).
Can anyone confirm (and maybe point at some reference)?
Does this change if I know that there is only one primary shard without any replicas (see other, similar querstion: Inconsistent ordering of results across primary /replica for documents with equivalent score) ? E.g. if I guarantee that there is one shard AND there is no change in the database between two invocations of the same query then that query will return results in the same order?
What are other alternatives (if any)?

I ended up using additional sort in cases where equal scores are likely to happen - for example searching by product category. This additional sort could be id, creation date or similar. The setup is 2 servers, 3 shards and 1 replica.

How to make ORDER by faster in cypher?

PART-I:
I have a lucene index on property a1 of a node n, and I have a cypher with
ORDER BY n.a1 DESC
Will it take advantage of the lucene index while sorting the results?
PART-II:
Lets assume i have similar indexes on a1, a2, a3...aN(individually), and I have a cypher with
ORDER BY n.a1, n.a2 DESC, n.a3... n.aN DESC
Will it take advantage of the indexes or, do i have to define some kind of a multi field index separately for this particular combination of the fields and asc/desc ?

Part I.
No. From the Java API you can add Lucene sort query objects.
Part II
No, see above.
The sorting happens without using any indexes just the results that are part of your query.
The index is only used to lookup nodes for starting points.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

An effective way to lookup duplicate nodes in Neo4j 1.8? - performance

Related

Cypher performance on matching million users

Neo4j optimization: Query for all graphs from selected to selected nodes

Cypher recommendation query performance

Paging in Elasticsearch when results have equal scores

How to make ORDER by faster in cypher?

Categories

Resources