Efficient cypher query matching subgraph connecting two groups of nodes - performance

my problem is the following. I have a small but dense network in Neo4j (~280 nodes, ~3600 relationships). There is only one type of node and one type of edge (i.e. a single label for each). Now, I'd like to specify two distinct groups of nodes, given by values for their "group" property, and match the subgraph consisting of all paths up to a certain length connecting the two groups. In addition I would like to add constraints on the relations. So, at the moment I have this:
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE ALL(c IN r WHERE c.weight > {w})
AND ALL(n in NODES(p) WHERE 1=length(filter(m in NODES(p) WHERE m=n)))
WITH DISTINCT r AS dr, NODES(p) AS ns
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
which achieves what I want but in some cases seems to be too slow. Here the WHERE statement filters out paths with relationships whose weight property is below a threshold as well as those containing cycles.
The last three lines have to do with the desired output format. Given the matching subgraph (paths), I want all unique relationships in one list, and all unique nodes in another (for visualization with d3.js). The only way I found to do this is to UNWIND all elements and then COLLECT them as DISTINCT.
Also note that the group properties and the weight limit are passed in as query parameters.
Now, is there any way to achieve the same result faster? E.g., with paths up to a length of 3 the query takes about 5-10 seconds on my local machine (depending on the connectedness of the chosen node groups), and returns on the order of ~50 nodes and a few hundred relationships. This seems to be in reach of acceptable performance. Paths up to length 4 however are already prohibitive (several minutes or never returns).
Bonus question: is there any way to specify the upper limit on path length as a parameter? Or does a different limit imply a totally different query plan?

This probably won't work at all, but it might give you something to play with. I tried changing a few things that may or may not work.
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE r.weight > {w}
WITH n1, NODES(p) AS ns, n2, DISTINCT r AS dr
WHERE length(ns) = 1
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)

Related

Neo4J - Finding the widest path on very large graphs

I have created a very large directional weighted graph, and I'm trying to find the widest path between two points.
each edge has a count property
Here is a small portion of the graph:
I have found this example and modified the query, so the path collecting would be directional like so:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
This query seems to require an enormous amount of memory, and fails even on partial data.
Update: for classification, the query is running until running out of memory.
I've found this link, that combines the use of spark and neo4j. Unfortunately Mazerunner for Neo4j, does not support "widest path" algorithm out of the box. What would be the right approach to run the "widest path" query on a very large graph?
The reason your algorithm is taking a long time to run is because (a) you have a big graph, (b) your memory parameters probably need tweaking (see comments) and (c) you're enumerating every possible path between ENTRY and EXIT. Depending on what your graph is structured like, this could be a huge number of paths.
Note that if you're looking for the broadest path, then broadest is the largest/smallest weight on an edge. This means that you're probably computing and re-computing many paths you can ignore.
Wikipedia has good information on this algorithm you should consider. In particular:
It is possible to find maximum-capacity paths and minimax paths with a
single source and single destination very efficiently even in models
of computation that allow only comparisons of the input graph's edge
weights and not arithmetic on them.[12][18] The algorithm maintains a
set S of edges that are known to contain the bottleneck edge of the
optimal path; initially, S is just the set of all m edges of the
graph. At each iteration of the algorithm, it splits S into an ordered
sequence of subsets S1, S2, ... of approximately equal size; the
number of subsets in this partition is chosen in such a way that all
of the split points between subsets can be found by repeated
median-finding in time O(m). The algorithm then reweights each edge of
the graph by the index of the subset containing the edge, and uses the
modified Dijkstra algorithm on the reweighted graph; based on the
results of this computation, it can determine in linear time which of
the subsets contains the bottleneck edge weight. It then replaces S by
the subset Si that it has determined to contain the bottleneck weight,
and starts the next iteration with this new set S. The number of
subsets into which S can be split increases exponentially with each
step, so the number of iterations is proportional to the iterated
logarithm function, O(logn), and the total time is O(m logn).[18] In
a model of computation where each edge weight is a machine integer,
the use of repeated bisection in this algorithm can be replaced by a
list-splitting technique of Han & Thorup (2002), allowing S to be
split into O(√m) smaller sets Si in a single step and leading to a
linear overall time bound.
You should consider implementing this approach with cypher rather than your current "enumerate all paths" approach, as the "enumerate all paths" approach has you re-checking the same edge counts for as many paths as there are that involve that particular edge.
There's not ready-made software that will just do this for you, I'd recommend taking that paragraph (and checking its citations for further information) and then implementing that. I think performance wise you can do much better than your current query.
Some thoughts.
Your query (and the original example query) can be simplified. This may or may not be sufficient to prevent your memory issue.
For each matched path, there is no need to: (a) create a collection of counts, (b) UNWIND it into rows, and then (c) perform a MIN aggregation. The same result could be obtained by using the REDUCE function instead:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1;
(I assume that the count property value is an int. 2147483647 is the max int value.)
You should create an index (or, perhaps more appropriately, a uniqueness constraint) on the name property of the Vertex label. For example:
CREATE INDEX ON :Vertex(name)
EDITED
This enhanced version of the above query might solve your memory problem:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
It creates (and ultimately deletes) a temporary :Temp node to keep track of the currently "winning" count and (the corresponding path nodes). (You must make sure that the label Temp is not otherwise used.)
The WITH clause starting with COLLECT(DISTINCT t) uses aggregation of distinct :Temp nodes (of which there is only 1) to ensure that Cypher only keeps a single reference to the :Temp node, no matter how many paths satisfy the WHERE clause. Also, that WITH clause does NOT include p, so that Cypher does not accumulate paths that we do not care about. It is this clause that might be the most important in helping to avoid your memory issues.
I have not tried this out.

Neo4j traversal performance

I want to perform an undirected traversal to extract all ids connected through a certain type of relationship
When I perform the following query it returns the values fast enough
MATCH path=(s:Node {entry:"a"})-[:RelType*1..10]-(x:Node)
RETURN collect(distinct ID(x))
However doing
MATCH path=(s:Node {entry:"a"})-[:RelType*]-(x:Node)
RETURN collect(distinct ID(x))
takes an huge amount of time. I suspect that by using * it searches every path from s to x, but since I want only the ids these paths can be discarded. What I really want is an BFS or DFS search to find the connect nodes from s.
Both query returns the exact same result since there are no elements with shortest path higher than 5 (only in the test example !).
Did you add an index for create index on :Node(entry) ?
Also depending on the # of rels per node in your path you get rels^10 (or general rels^steps) paths through your graph that are potentially returned.
Can you try first with a smaller upper limit like 3 and work from there?
Also leaving off the direction really hurts as you then get cycles.
What you can also try to do is:
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN ID(X)
and stream the results and do the uniqueness in the client
Or this if you don't want to do uniqueness in the client
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN distinct ID(X)

Getting all relationships among neighbors of a node

I have an embedded graph db of nodes (twitter users) and directed edges (follows).
I'm trying to get all relationships among the users (Set A) who are followed by a specified user (Node U). Also the relationships between the nodes in A and the specified node U.
My query:
START u=node:user_id(user_id={id_of_U})
MATCH p = u-->following, p2= following-[?]->u, p3 = following-[?]->()<--u
RETURN distinct rels(p),rels(p2),rels(p3)
This query gives me what I expect but the problem is, it takes so much time when the specified user is following too many users.
I tried lots of queries and the query above is the best one so far. Yet, I'm sure there are more efficient ways to do this, because when I get those relationships in a java method by walking through all users in "A", getting all relationships for each of them (Direction.BOTH), and then filtering the relationships with "A" (remove relationships that have start or end node that does not belong to "A"), it takes just 8 secs for a user following 500 people, whereas the cypher query cannot even fail without blowing my heap up...
Can you try this one?
start u=node:user_id(user_id={id_of_U})
MATCH u-[r]->following
with u, r, following
match following-[r2?]->u, following-[r3?]->()<-[r4]-u
RETURN distinct r, r2, r3, r4
Also, are you using the latest 1.9?
starting with p = u-->following is not optimal, since it takes all related nodes and later on you try to filter on these nodes. i'd suggest to pick up less nodes and later on expand this set a little bit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[:FOLLOWS]->following
WITH u,following
MATCH u-[r]-following
RETURN distinct r;
this will give you all the relationships between nodes in setA who are also folowed by node U.
in case you dont have the relationship FOLLOW in your graph - you should have, otherwise you graph design is'nt optimal. i noticed you are not using any specific rel type in your query - this can be optimal if and only if you have just 1 relationship type in your data. as far as i understand your question, you got more than 1 rel type.
edit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[]-following
WITH u, following
MATCH u-[r]-again, again-[r2]-following
RETURN r, r2

Document retrieval with unwanted words

I am building a data structure that helps indexing a collection of S documents of total length n, such that it supports the following query: Given two words P1 and P2, count all the documents that contain P1 but not P2. I want the answer to be complete (not to miss results).
I've built a generalized suffix tree and pick every sqrt(n)-th leaf and its ancestors (and delete every one-childed node). For each internal node v I pre-calculate the answer for the query against node u.
But with this, if the query contains words that appear in the tree in nodes v and u, I can have the answer in O(1), but what can I do when the words are not in one of the nodes that we picked?
I can do it easily by keeping a O(n2) data structure with pre-processing and having all the possible answers ready for O(1) time retrieval, but the goal is to build this data structure in O(n) space and make the queries as efficient as possible.
It sounds like an inverted index would still be useful to you. It's a map of words onto ordered lists of documents containing them. The documents need to have a common, total ordering, and it is in this order in which they appear in their per-word buckets.
Assuming your n is total length of the corpus in word occurrences (and not vocabulary size), it can be constructed in O(n log n) time and linear space.
Given P1 and P2, you make two separate queries to get the documents containing the two terms respectively. Since the two lists share a common ordering, you can do a linear merge-like algorithm and select just those documents containing P1 but not P2:
c1 <- cursor to first element of list of docs containing P1
c2 <- cursor to first element of list of docs containing P2
results <- [] # our return value
while c1 not exhausted
if c2 exhausted or *c1 < *c2
results.append(c1++)
else if *c1 == *c2
c1++
c2++
else # *c1 > *c2
c2++
return results
Notice every pass of the loop iterates at least one cursor; it runs in linear time in the sum of the sizes of the two initial queries. Since only things from the c1 cursor enter results, we know all results contain P1.
Finally, note we always advance only the "lagging" cursor: this (and the total document ordering) guarantees that if a document appears in both initial queries, there will be a loop iteration where both cursors point to that document. When this iteration occurs, the middle clause necessarily kicks in and the document is skipped by advancing both cursors. Thus documents containing P2 necessarily do not get added to results.
This query is an example of a general class called Boolean queries; it's possible to extend this algorithm to cover most any boolean expression. Certain queries break the efficiency of the algorithm (by forcing it to walk over the entire vocabulary space) but basically so long as you don't negate each term (i.e. don't ask for not P1 and not P2) you're fine. See this for an in-depth treatment.

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Resources