How to fetch a subgraph of first neighbors in neo4j? - performance

I fetch first n neighbors of a node with this query in neo4j:
(in this example, n = 6)
I have a weighted graph, and so I also order the results by weight:
START start_node=node(1859988)
MATCH start_node-[rel]-(neighbor)
RETURN DISTINCT neighbor,
rel.weight AS weight ORDER BY proximity DESC LIMIT 6;
I would like to fetch a whole subgraph, including second neighbors (first neighbors of first six children).
I tried smtg like :
START start_node=node(1859988)
MATCH start_node-[rel]-(neighbor)
FOREACH (neighbor | MATCH neighbor-[rel2]-(neighbor2) )
RETURN DISTINCT neighbor1, neighbor2, rel.proximity AS proximity ORDER BY proximity DESC LIMIT 6, rel2.proximity AS proximity ORDER BY proximity DESC LIMIT 6;
the syntax is still wrong but I am also uncertain about the output:
I would like to have a table of tuples parent, children and weight:
[node_A - node_B - weight]
I would like to see if it is performing better one query or six queries.
Can someone help in clarifying how to iterate a query (FOREACH) and format the output?
thank you!

Ok, I think I understand. Here's another attempt based on your comment:
MATCH (start_node)-[rel]-(neighbor)
WHERE ID(start_node) IN {source_ids}
WITH
neighbor, rel
ORDER BY rel.proximity
WITH
collect({neighbor: neighbor, rel: rel})[0..6] AS neighbors_and_rels
UNWIND neighbors_and_rels AS neighbor_and_rel
WITH
neighbor_and_rel.neighbor AS neighbor,
neighbor_and_rel.rel AS rel
MATCH neighbor-[rel2]-(neighbor2)
WITH
neighbor,
rel,
neighbor2,
rel2
ORDER BY rel.proximity
WITH
neighbor,
rel,
collect([neighbor2, rel2])[0..6] AS neighbors_and_rels2
UNWIND neighbors_and_rels2 AS neighbor_and_rel2
RETURN
neighbor,
rel,
neighbor_and_rel2[0] AS neighbor2,
neighbor_and_rel2[1] AS rel2
It's a bit long, but hopefully it gives you the idea at least

First you should avoid using START as it will (hopefully) eventually go away.
So to get a neighborhood you could use variable length paths to get all of the paths away from the node
MATCH path=start_node-[rel*1..3]-(neighbor)
WHERE ID(start_node) = 1859988
RETURN path, nodes(path) AS nodes, EXTRACT(rel IN rels(path) | rel.weight) AS weights;
Then you can take the path / nodes and combine them in memory with your language of choice.
EDIT:
Also take a look at this SO Question: Fetch a tree with Neo4j
It shows how to get the output as a set of start/end nodes for each of the relationships which can be nicer in many cases.

Related

Finding optimal swapping paths in employees moving to different cities

We have a problem where we want to find the optimal path for swapping employees' locations across the country.
Hypothetically, a company allows for employees to request to move to another city only if a vacancy is available in that city, and also if someone is willing to take their soon-to-be vacant position. Examine the example:
Employee A who currently works in Los Angeles wants to move to Boston.
Employee B who currently works in Boston wants to move to New York.
Employee C who currently works in New York wants to move to Los Angeles.
In the above triangle, we can grant all three employees the permission to do the move, since there won't be any vacancies once they move. But the situation gets more complex when:
Multiple employees are competing for the same location. We can solve this with a hypothetical score of some sort, like more years working for the company gets the priority.
We have more cities to consider. (in the hundreds)
We have more employees to consider. (in the hundreds of thousands)
Ultimately the goal is to grant the highest number of move permissions without leading to any vacancies in the system.
We're currently exploring the idea of simulating all the swapping paths, and then selecting the one that generates the highest number of moves.
But I feel that this problem existed in the wild before, I just don't know what keywords to look for in order to get more insights. Any ideas? What algorithms should we look into?
Remove the impossible move requests, like this
A,B are specific cities. n is amy city
RAB is a request to move from A to B
RAn is a request to move from A
RnA is a reuest to move to A
CAn is the number of requests to move from A
CnA is the number of requests to move to A
set flag TRUE
WHILE ( flag == TRUE )
set flag = FALSE
LOOP A over all cities
IF CAn > CnA then not all RAn can be permitted.
Remove lower scoring requests until CAn == CnA.
set flag TRUE
Once these "impossible" moves are removed, all of the remaining move requests are "in-balance". That is, all of the move requests to a city are equal all of those from a city. From that point on it no longer matters which cycles you choose to implement: once you implement them, the remainings requests are still all in-balance. And no matter which move-cycle and which order they are implemented in, it stays in-balance until all remaining requests are zero, and the total number of moves will be exactly the same no matter how they are implemented. ( This explanation is due to https://stackoverflow.com/users/109122/rbarryyoung )
Here is C++ code implementing this
void removeForbidden()
{
bool flag = true;
while (flag)
{
flag = false;
for (auto &city : sCity)
{
auto vFrom = RequestCountFrom(city);
auto vTo = RequestCountTo(city);
if (vFrom.size() > vTo.size())
{
for (int k = vTo.size(); k < vFrom.size(); k++)
{
vFrom[k]->allowed = false;
}
flag = true;
}
}
}
std::cout << "Permitted moves:\n";
for (auto &R : vRequest)
{
if (R.allowed)
std::cout << R.text();
}
}
The complete application code is at https://gist.github.com/JamesBremner/5f49beaca59a7a7043e356fbb35f0d09
The input is a space delimited text file with 4 columns: employee name, employee score, from city, to city
Here is sample input based on your example but adding another request that cannot be permitted
e1 1 a b
e2 1 b c
e3 1 c a
e4 0 a c
The output from this is
Permitted moves:
e1 1 a b
e2 1 b c
e3 1 c a
Note: I have not implemented the scoring. For simplicity I assume that move requests are entered in order of descending score. So, the requests that are dropped when necessary, change according to the order you enter them. I assume you will be able to implement whatever scoring system you require. Also note that, unless you calculate a unique score for every request from a city, then which requests are denied may vary with the order of input.
I was about to post this in a comment but it was more than the the actually allowed characters.
I'm not sure about existing advanced algorithms that could potentially solve this problem, but you can custom fit some fundamental ones:
An employee wanting to move from city1 to some city2 is a directed edge from city1 to city2. Make sure that if 2 employees want to move from A to B, you add 2 directed edges for that or somehow keep count of the quantity.
Find disjoint components of the graph.
In each disjoint component, find the largest possible circle. A circle means A -> B -> C -> A.
Remove those edges and keep count of the number of successful swaps.
Rpeat until there are no circles in any of the disjoint components.
This is a greedy algorithm. At the moment I'm still not quite sure if it would produce the optimal solution in each and every situation. Any input is appreciated.

How to get level(depth) number of two connected nodes in neo4j

I'm using neo4j as a graph database to store user's connections detail into this. here I want to show the level of one user with respect to another user in their connections like Linkedin. for example- first layer connection, second layer connection, third layer and above the third layer shows 3+. but I don't know how this happens using neo4j. i searched for this but couldn't find any solution for this. if anybody knows about this then please help me to implement this functionality.
To find the shortest "connection level" between 2 specific people, just get the shortest path and add 1:
MATCH path = shortestpath((p1:Person)-[*..]-(p2:Person))
WHERE p1.id = 1 AND p2.id = 2
RETURN LENGTH(path) + 1 AS level
NOTE: You may want to put a reasonable upper bound on the variable-length relationship pattern (e.g., [*..6]) to avoid having the query taking too long or running out of memory in a large DB). You should probably ignore very distant connections anyway.
it would be something like this
// get all persons (or users)
MATCH (p:Person)
// create a set of unique combinations , assuring that you do
// not do double work
WITH COLLECT(p) AS personList
UNWIND personList AS personA
UNWIND personList AS personB
WITH personA,personB
WHERE id(personA) < id(personB)
// find the shortest path between any two nodes
MATCH path=shortestPath( (personA)-[:LINKED_TO*]-(personB) )
// return the distance ( = path length) between the two nodes
RETURN personA.name AS nameA,
personB.name AS nameB,
CASE WHEN length(path) > 3 THEN '3+'
ELSE toString(length(path))
END AS distance

Edit Distance Data Structure

I am new to data structure want to know the flow of this diagram as mentioned ,it's for calculating minimum edit distance between two string ,in the graph i understood that String 1 is of three length and String 2 is also of three length , so tutorial shown graph from eD(3,3) then why the graph split again in eD(3,2),eD(2,3),eD(2,2) for the 2 level of recursion . What it signifies ? Please need detail explanation . Why we can't split level 2 ,like this eD(3,2),eD(2,3).
I am following this Url : https://www.geeksforgeeks.org/dynamic-programming-set-5-edit-distance/
enter image description here
Ok, so basically in case of Edit Distance, we are trying to either insert, update or delete an element. So, our basic approach is to try all three of these available operations at each point and check which case gives the best result.
Specific to the case that you are trying to understand, the following is the scenario:
eD(3,3) = eD(2,2) if str1[3] == str2[3] # Without incuring any cost, we find edit distance of the remaining strings.
else l + min(del, ins, rep)
where
del = eD(2,3), Deleted the last character of str1, and finding the edit distance of the remaining strings.
ins = eD(3,2), Inserted the last element of str2 in str1, thus now we are finding the edit distance between the remaining strings. E.g. 'adc' and 'axf', if we add 'f' to the first string, it will become 'adcf'. Thus now, the last characters of both strings is same, I have to eventually find the edit distance between 'adc' and 'ax', Thus it becomes eD(3,2).
rep = eD(2,2), Replaced the last element of str1 with the last element of str2, and finding the edit distance of the remaining strings. E.g. 'abc' and adf, if we replace the last character of first string with the last character of second string, we will get 'abf' and 'adf', as now the last characters of both the strings are same, eventually we are finding the edit distance between 'ab' and 'ad'. Thus, eD(2,2)

ArangoDB 3.2 traversal: exclude edge collection

I am doing an AQL traversal with ArangoDB 3.2 in which I retrieve the nodes connected to my vertexCollection like this:
For v, e, p IN 1..10 ANY vertexCollection GRAPH myGraph OPTIONS {uniqueVertices: "global", bfs:true}
RETURN v._id
and now I want to skip the nodes from paths where a particular edge collection is used. I know I can filter for particular attributes in lists, like FILTER p.edges[*].type ALL == 'whatever' but I do not find how to apply this into IS_SAME_COLLECTION() to filter by collection.
I discard the option of specifying exactly the edgeCollection in the traversal instead of the GRAPH because it's just one particular edgeCollection that I want to skip vs. many that I want to go through.
I don't know whether there is already an implementation for 'skip edge collection' or something like that in a graph traversal, so far I could not find it.
Note:
I tried to filter like this
For v, e, p IN 1..10 ANY vertexCollection GRAPH myGraph OPTIONS {uniqueVertices: "global", bfs:true}
FILTER NOT IS_SAME_COLLECTION('edgeToSkip', e._id)
RETURN v._id
But here I simply skip the nodes directly connected with edge 'edgeToSkip' but not all nodes within the path where 'edgeToSkip' is present. So I need, not only to exclude that particular edge, but stop traversing when it is found.
Thanks
UPDATE:
I found a workaround, basically I gather all edges present in a 'path' and then filter out if the edge I want to skip is in the 'path'. Note I change from uniqeVertices: "global" to uniqueVertices: "path".
.
For v, e, p IN 1..10 ANY vertexCollection GRAPH myGraph OPTIONS {uniqueVertices: "path", bfs:true}
# collect edge names (collection name) in the current path
LET ids = (
FOR edge IN p.edges
RETURN PARSE_IDENTIFIER(edge)["collection"]
)
# filter out if edge name (edgeToSkip) is present
FILTER 'edgeToSkip' NOT IN ids
RETURN v._id
This way, once the edgeToSkip is found in the path, no vertex is returned, but vertices before the 'edgeToSkip' yes
If the graph is like this:
vertexA --edge1--> vertexB --edge2--> vertexC --edgeToSkip--> vertexD --edge3--> vertexE
Will return:
vertexA, vertexB and vertexC (but not vertexD and vertexE)
I found a workaround, basically I gather all edges present in a 'path' and then filter out if the edge I want to skip is in the 'path'. Note I change from uniqeVertices: "global" to uniqueVertices: "path".
.
For v, e, p IN 1..10 ANY vertexCollection GRAPH myGraph OPTIONS {uniqueVertices: "path", bfs:true}
# collect edge names (collection name) in the current path
LET ids = (
FOR edge IN p.edges
RETURN PARSE_IDENTIFIER(edge)["collection"]
)
# filter out if edge name (edgeToSkip) is present
FILTER 'edgeToSkip' NOT IN ids
RETURN v._id
This way, once the edgeToSkip is found in the path, no vertex is returned, but vertices before the 'edgeToSkip' yes
If the graph is like this:
vertexA --edge1--> vertexB --edge2--> vertexC --edgeToSkip--> vertexD --edge3--> vertexE
Will return:
vertexA, vertexB and vertexC (but not vertexD and vertexE)

Can this Neo4j query be optimized?

I have rather large dataset (20mln nodes, 200mln edges), simplest shortestPath queries finish in milliseconds, everything is great.
But... I need to allow shortestPath to have ZERO or ONE relation of type 999 and it can be only the first from the start node.
So, my query became like this:
MATCH (one:Obj{oid:'startID'})-[r1*0..1]-(b:Obj)
WHERE all(rel in r1 where rel.val = 999)
WITH one, b
MATCH (two:Obj{oid:'endID'}), path=shortestPath((one) -[0..21]-(two))
WHERE ALL (x IN RELATIONSHIPS(path)
WHERE (x.val > -1 and x.val<101) or (x.val=999 or x.val=998)) return path
it runs in milliseconds when there's a short path (up to 2-4), but can take 5 or 20 seconds for paths like 5++. Maybe I've composed inefficient query?
This question will be bountied when available.
Some of your requirements are a bit unclear to me, so I'll reiterate my understanding and offer a solution.
You want to inspect the shortest paths between a start and end node.
The paths returned should have ZERO or ONE relationship with a val of 999. If it's ONE relationship with that value, it should be the first.
Here's an attempt based on that logic:
MATCH (start:Obj {oid:'startID'}),
(end:Obj {oid:'endID'}),
path=shortestPath((start)-[1..21]->(end))
WITH path, relationships(path) AS rels
WHERE all(r IN relationships WHERE r.val != 999)
OR (relationships[0].val = 999
AND all(r IN relationships[1..] WHERE r.val != 999))
RETURN path
I haven't had a chance to test on actual data, but hopefully this logic and approach at least point you in the right direction.
Also note: it's possible the entire WHERE clause at the end could be reduced to:
WHERE all(r IN relationships[1..] WHERE r.val != 999)
Meaning you don't even need to check the first relationship.

Resources