Specifying starting point for graph traversal algorithm in neo4j - algorithm

I'm trying to write an algorithm which will propagate values from a starting node to the entire connected component. Basically, if A receives 5 requests, and A sends 5 requests to B for each request A receives, B will receive 25 requests.
So basically, I'm trying to go from this
to this
I've written the following snippet in neo4j:
MATCH (a:Loc)-[r:ROAD]->(b:Loc)
SET b.volume = b.volume + a.volume * r.cost
RETURN a,r,b
But, what I don't know is how I am supposed to specify a starting point for this algorithm to start? It appears as if neo4j is updating the values correctly in this case, but I don't think this will work for a larger graph. I want to explicitly make the algorithm start propagating values from the START node.
Thanks.

I'm sure there will be a better answer, and this approach has some limitations since some assumptions are made about the graph, but this works for your example.
Note that I added an id property to the :Loc nodes, but I only used it to select the start (and for printing the node id at the end).
MATCH p=(n:Loc)<-[:ROAD*]-(:Loc {id: 0})
WITH DISTINCT n, max(length(p)) as maxLp
ORDER BY maxLp // order the nodes by their maximum distance from start
MATCH (n)<-[r:ROAD]-(p:Loc)
SET n.volume = n.volume + r.cost * p.volume
RETURN DISTINCT n.id, n.volume
And here's the result:
n.id n.volume
1 4000
2 200000
3 200000
4 16400000
5 508000000
6 21632000000
The idea here was to get the longest paths to each node from the starting node. These are ordered by "closeness" and then the volumes are updated in order of "closeness".

In this case the planner will use the labels to find starting places for the query (you can run an EXPLAIN of the query to see the query plan), so it's going to match to all :Loc nodes and expand the pattern and modify the properties accordingly.
This will be for all :Loc nodes, is that what you want, or do you only want this to apply for some smaller portion of your graph reachable from some starting node?

Related

Neo4j optimization: Query for all graphs from selected to selected nodes

I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);

Cypher recommendation query performance

I am working with rNeo4j for a recommendation application and I am having some issues writing an efficient query. The goal of the query is to recommend an item to a user, with the stipulation that they have not used the item before.
I want to return the item's name, the nodes on the path (for a visualization of the recommendation), and some additional measures to be able to make the recommendation as relevant as possible. Currently I'm returning the number of users that have used the item before, the length of the path to the recommendation, and a sum of the qCount relationship property.
Current query:
MATCH (subject:User {id: {idQ}), (rec:Item),
p = shortestPath((subject)-[*]-(rec))
WHERE NOT (subject)-[:ACCESSED]->(rec)
MATCH (users:User)-[:ACCESSED]->(rec)
RETURN rec.Name as Item,
count(users) as popularity,
length(p) as pathLength,
reduce(weight = 0, q IN relationships(p)| weight + toInt(q.qCount)) as Strength,
nodes(p) as path
ORDER BY pathLength, Strength DESCENDING, popularity DESCENDING
LIMIT {resultLimit}
The query appears to be working correctly, but it takes too long for the desired application (around 8 seconds). Does anyone have some suggestions for how to improve my query's performance?
I am new to cypher so I apologize if it is something obvious to a more advanced user.
One thing to consider is specifying an upper bound on the variable length path pattern like this: p = shortestPath((subject)-[*2..5]->(rec)) This limits the number of relationships in the pattern to a maximum of 5. Without setting a maximum performance can be poor, as paths of all lengths are considered.
Another thing to consider: by summing the relationship property qCount across all nodes in the path and then sorting by this sum you are looking for the shortest weighted path. Neo4j includes some graph algorithms (such as Dijkstra) for finding these paths efficiently, however they are not exposed via Cypher. See this page for more info.

Neo4j match by range performance

I've got following setup:
About 1,5m nodes of type IpRangeBlock consisting of start and end properties - both of them are of type Long. There's an index on the start property.
What I then do is to find a range containing given IP. So, e.g. for ip 0.0.0.2 I convert it to long and then perform comparison on all nodes n.start <= 2 && n.end >= 2.
The cypher query I run looks like this:
MATCH (n:IpRangeBlock) WHERE n.start <= {ip} AND n.end >= {ip} RETURN n LIMIT 1
All is fine, though as I mentioned, for 1,5m nodes I have it can take up to 20s for Neo4j to find matching range. My question is, is there a way to speed up this operation or is the fault in my db design?
Ok, I tried caching node references and performing the comparison on the app side. As you might expect - pulling that much of nodes takes time.
So I tried another approach - I examined our data set and it turned out that all ip ranges' start and end properties begin with the same first octet. I used those octets as grouping nodes to quickly narrow down subset of probable IP ranges. This worked well, as our dataset is actually well distributed across all ip ranges. now, instead of comparing 100k nodes' properties, each query has to do it 'only' for around 8-10k.
I know it's not perfect aproach but it worked for me. There's neo4j article I got this idea from.

An effective way to lookup duplicate nodes in Neo4j 1.8?

I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8).
The nodes that need examination all have a (non-indexed) property externalId for which I want to find duplicates of. This is the Cypher query I've got:
START n=node(*), dup=node(*) WHERE
HAS(n.externalId) AND HAS(dup.externalId) AND
n.externalId=dup.externalId AND
ID(n) < ID(dup)
RETURN dup
There are less than 10K nodes in the data and less than 1K nodes with an externalId.
The query above is working but seems to perform badly. Is there a less memory consuming way to do this?
Try this query:
START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;
It avoids taking the Cartesian product of your nodes. It finds the distinct externalId values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.
The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)
I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:
START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)

Neo4j - slow cypher query - big graph with hierarchies

Using Neo4j 2.1.4. I have a graph with 'IS A' relationships (and other types of relationships) between nodes. I have some hierarchies inside the graph (IS A relationships) and I need to know the descendants (IS A relationship) of one hierarchy that has a particular-known relationship with some descendant of second hierarchy. If that particular-known relationship exists, I return the descendant/s of the first hierarchy.
INPUTS: 'ID_parentnode_hierarchy_01', 'ID_relationship', 'ID_parentnode_hierarchy_02'.
OUTPUT: Descendants (IS A relationship) of 'ID_parentnode_hierarchy_01' that has 'ID_relationship' with some descendant of 'ID_parentnode_hierarchy_02'.
Note: The graph has 500.000 nodes and 2 million relationships.
I am using this cypher query but it is very slow (aprox. 40s in a 4GB RAM and 3GHz Pentium Dual Core 64 bit PC). It is possible to build a faster query?
MATCH (parentnode_hierarchy_01: Node{nodeid : {ID_parentnode_hierarchy_01}})
WITH parentnode_hierarchy_01
MATCH (parentnode_hierarchy_01) <- [:REL* {reltype: {isA}}] - (descendants01: Node)
WITH descendants01
MATCH (descendants01) - [:REL {reltype: {ID_relationship}}] -> (descendants02: Node)
WITH descendants02, descendants01
MATCH (parentnode_hierarchy_02: Node {nodeid: {ID_parentnode_hierarchy_02} })
<- [:REL* {reltype: {isA}}] - (descendants02)
RETURN DISTINCT descendants01;
Thank you very much.
Well, I can slightly clean up your query - this might help us understand the issues better. I doubt this one will run faster, but using the cleaned up version we can discuss what's going on: (mostly eliminating unneeded uses of MATCH/WITH)
MATCH (parent:Node {nodeid: {ID_parentnode_hierarchy_01}})<-[:REL* {reltype:{isA}}]-
(descendants01:Node)-[:REL {reltype:{ID_relationship}}]->(descendants02:Node),
(parent2:Node {nodeid: {ID_parentnode_hierarchy_02}})<-[:REL* {reltype:{isA}}]-
(descendants02)
RETURN distinct descendants01;
This looks like you're searching two (probably large) trees, starting from the root, for two nodes somewhere in the tree that are linked by an {ID_relationship}.
Unless you can provide some query hints about which node in the tree might have an ID_relationship or something like that, at worst, this looks like you could end up comparing every two nodes in the two trees. So this looks like it could take n * k time, where n is the number of nodes in the first tree, k the number of nodes in the second tree.
Here are some strategy things to think about - which you should use depends on your data:
Is there some depth in the tree where these links are likely to be found? Can you put a range on the depth of [:REL* {reltype:{isA}}]?
What other criteria can you add to descendants01 and descendants02? Is there anything that can help make the query more selective so that you're not comparing every node in one tree to every node in the other?
Another strategy you might try is this: (this might be a horrible idea, but it's worth trying) -- basically look for a path from one root to the other, over any number of undirected edges of either isa type, or the other. Your data model has :REL relationships with a reltype attribute. This is probably an antipattern; instead of a reltype attribute, why is the relationship type not just that? This prevents the query that I want to write, below:
MATCH p=shortestPath((p1:Node {nodeid: {first_parent_id}})-[:isA|ID_relationship*]-(p2:Node {nodeid: {second_parent_id}}))
return p;
This would return the path from one "root" to the other, via the bridge you want. You could then use path functions to extract whatever nodes you wanted. Note that this query isn't possible currently because of your data model.

Resources