I want to perform an undirected traversal to extract all ids connected through a certain type of relationship
When I perform the following query it returns the values fast enough
MATCH path=(s:Node {entry:"a"})-[:RelType*1..10]-(x:Node)
RETURN collect(distinct ID(x))
However doing
MATCH path=(s:Node {entry:"a"})-[:RelType*]-(x:Node)
RETURN collect(distinct ID(x))
takes an huge amount of time. I suspect that by using * it searches every path from s to x, but since I want only the ids these paths can be discarded. What I really want is an BFS or DFS search to find the connect nodes from s.
Both query returns the exact same result since there are no elements with shortest path higher than 5 (only in the test example !).
Did you add an index for create index on :Node(entry) ?
Also depending on the # of rels per node in your path you get rels^10 (or general rels^steps) paths through your graph that are potentially returned.
Can you try first with a smaller upper limit like 3 and work from there?
Also leaving off the direction really hurts as you then get cycles.
What you can also try to do is:
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN ID(X)
and stream the results and do the uniqueness in the client
Or this if you don't want to do uniqueness in the client
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN distinct ID(X)
Related
I want to find an efficient data structure that can handle the following use case.
I can add new elements to this data structure, e.g.
I call add() API, add([2,3,4,5,3]), then this data structure stores [2,3,3,4,5]. I can query some target and return how many numbers smaller than this target. e.g. query(4), return 3 (since one 2 and two 3). And the frequencies of calling add and query are in the same order.
Firstly, I think of segment tree, however, the input number can be anyone in int value, space will be O(2^32)
Could you give me some advice about which data structure should I use?
You can do this using an order statistic tree, which is a kind of binary search tree where each node also stores the cardinality of its own subtree. Inserting into an order statistic tree still takes O(log n) time, because it's a binary search tree, although the insert operation is a little more complicated because it has to keep the cardinalities of each node up-to-date.
Computing the number of members less than a given target also takes O(log n) time; start at the root node:
If the target is less than or equal to the root node's value, then recurse on the left subtree.
Otherwise, return the left child's cardinality plus the result of recursing on the right subtree.
The base case is that you always return 0 for an empty subtree.
I have created a very large directional weighted graph, and I'm trying to find the widest path between two points.
each edge has a count property
Here is a small portion of the graph:
I have found this example and modified the query, so the path collecting would be directional like so:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
This query seems to require an enormous amount of memory, and fails even on partial data.
Update: for classification, the query is running until running out of memory.
I've found this link, that combines the use of spark and neo4j. Unfortunately Mazerunner for Neo4j, does not support "widest path" algorithm out of the box. What would be the right approach to run the "widest path" query on a very large graph?
The reason your algorithm is taking a long time to run is because (a) you have a big graph, (b) your memory parameters probably need tweaking (see comments) and (c) you're enumerating every possible path between ENTRY and EXIT. Depending on what your graph is structured like, this could be a huge number of paths.
Note that if you're looking for the broadest path, then broadest is the largest/smallest weight on an edge. This means that you're probably computing and re-computing many paths you can ignore.
Wikipedia has good information on this algorithm you should consider. In particular:
It is possible to find maximum-capacity paths and minimax paths with a
single source and single destination very efficiently even in models
of computation that allow only comparisons of the input graph's edge
weights and not arithmetic on them.[12][18] The algorithm maintains a
set S of edges that are known to contain the bottleneck edge of the
optimal path; initially, S is just the set of all m edges of the
graph. At each iteration of the algorithm, it splits S into an ordered
sequence of subsets S1, S2, ... of approximately equal size; the
number of subsets in this partition is chosen in such a way that all
of the split points between subsets can be found by repeated
median-finding in time O(m). The algorithm then reweights each edge of
the graph by the index of the subset containing the edge, and uses the
modified Dijkstra algorithm on the reweighted graph; based on the
results of this computation, it can determine in linear time which of
the subsets contains the bottleneck edge weight. It then replaces S by
the subset Si that it has determined to contain the bottleneck weight,
and starts the next iteration with this new set S. The number of
subsets into which S can be split increases exponentially with each
step, so the number of iterations is proportional to the iterated
logarithm function, O(logn), and the total time is O(m logn).[18] In
a model of computation where each edge weight is a machine integer,
the use of repeated bisection in this algorithm can be replaced by a
list-splitting technique of Han & Thorup (2002), allowing S to be
split into O(√m) smaller sets Si in a single step and leading to a
linear overall time bound.
You should consider implementing this approach with cypher rather than your current "enumerate all paths" approach, as the "enumerate all paths" approach has you re-checking the same edge counts for as many paths as there are that involve that particular edge.
There's not ready-made software that will just do this for you, I'd recommend taking that paragraph (and checking its citations for further information) and then implementing that. I think performance wise you can do much better than your current query.
Some thoughts.
Your query (and the original example query) can be simplified. This may or may not be sufficient to prevent your memory issue.
For each matched path, there is no need to: (a) create a collection of counts, (b) UNWIND it into rows, and then (c) perform a MIN aggregation. The same result could be obtained by using the REDUCE function instead:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1;
(I assume that the count property value is an int. 2147483647 is the max int value.)
You should create an index (or, perhaps more appropriately, a uniqueness constraint) on the name property of the Vertex label. For example:
CREATE INDEX ON :Vertex(name)
EDITED
This enhanced version of the above query might solve your memory problem:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
It creates (and ultimately deletes) a temporary :Temp node to keep track of the currently "winning" count and (the corresponding path nodes). (You must make sure that the label Temp is not otherwise used.)
The WITH clause starting with COLLECT(DISTINCT t) uses aggregation of distinct :Temp nodes (of which there is only 1) to ensure that Cypher only keeps a single reference to the :Temp node, no matter how many paths satisfy the WHERE clause. Also, that WITH clause does NOT include p, so that Cypher does not accumulate paths that we do not care about. It is this clause that might be the most important in helping to avoid your memory issues.
I have not tried this out.
my problem is the following. I have a small but dense network in Neo4j (~280 nodes, ~3600 relationships). There is only one type of node and one type of edge (i.e. a single label for each). Now, I'd like to specify two distinct groups of nodes, given by values for their "group" property, and match the subgraph consisting of all paths up to a certain length connecting the two groups. In addition I would like to add constraints on the relations. So, at the moment I have this:
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE ALL(c IN r WHERE c.weight > {w})
AND ALL(n in NODES(p) WHERE 1=length(filter(m in NODES(p) WHERE m=n)))
WITH DISTINCT r AS dr, NODES(p) AS ns
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
which achieves what I want but in some cases seems to be too slow. Here the WHERE statement filters out paths with relationships whose weight property is below a threshold as well as those containing cycles.
The last three lines have to do with the desired output format. Given the matching subgraph (paths), I want all unique relationships in one list, and all unique nodes in another (for visualization with d3.js). The only way I found to do this is to UNWIND all elements and then COLLECT them as DISTINCT.
Also note that the group properties and the weight limit are passed in as query parameters.
Now, is there any way to achieve the same result faster? E.g., with paths up to a length of 3 the query takes about 5-10 seconds on my local machine (depending on the connectedness of the chosen node groups), and returns on the order of ~50 nodes and a few hundred relationships. This seems to be in reach of acceptable performance. Paths up to length 4 however are already prohibitive (several minutes or never returns).
Bonus question: is there any way to specify the upper limit on path length as a parameter? Or does a different limit imply a totally different query plan?
This probably won't work at all, but it might give you something to play with. I tried changing a few things that may or may not work.
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE r.weight > {w}
WITH n1, NODES(p) AS ns, n2, DISTINCT r AS dr
WHERE length(ns) = 1
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
We need to maintain mobileNumber and its location in memory.
The challenge is that we have more than 5 million of users
and storing the location for each user will be like hash map of 5 million records.
To resolve this problem, we have to work on ranges
We are given ranges of phone numbers like
range1 start="9899123446" end="9912345678" location="a"
range2 start="9912345679" end="9999999999" location="b"
A number can belong to single location only.
We need a data structure to store these ranges in the memory.
It has to support two functions
findLocation(Integer number) it should return the location name to
which number belongs
changeLocation( Integer Number , String range). It changes location of Number from old location to new location
This is completely in memory design.
I am planning to use tree structure with each node contains ( startofrange , endofrange ,location).
I will keep the nodes in sorted order. I have not finalized anything yet.
The main problem is-- when 2nd function to change location is called say 9899123448 location to b
The range1 node should split to 3 nodes 1st node (9899123446,9899123447,a)
2nd node (9899123448,9899123448,b) 3rd node (9899123449,9912345678,a).
Please suggest the suitable approach
Thanks in advance
Normally you can use specialized data structures to store ranges and implement the queries, e.g. Interval Tree.
However, since phone number ranges do not overlap, you can just store the ranges in a standard tree based data structure (Binary Search Tree, AVL Tree, Red-Black Tree, B Tree, would all work) sorted only by [begin].
For findLocation(number), use corresponding tree search algorithm to find the first element that has [begin] value smaller than the number, check its [end] value and verify if the number is in that range. If a match if found, return the location, otherwise the number is not in any range.
For changeLocation() operation:
Find the old node containing the number
If an existing node is found in step 1, delete it and insert new nodes
If no existing node is found, insert a new node and try to merge it with adjacent nodes.
I am assuming you are using the same operation for simply adding new nodes.
More practically, you can store all the entries in a database, build an index on [begin].
First of all range = [begin;end;location]
Use two structures:
Sorted array to store ranges begins
Hash-table to access ends and locations by begins
Apply following algo:
Use binary search to find "nearest less" value ob begin
Use hash-table to find end and location for begin
Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?
down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.
Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.