Neo4j Graph depth traversal Cypher - performance

I'm using neo4j as a graph database and I want to return from a starting node neighbors of that node, and all the related neighbors to a depth varying from 1 to 3. I'm Doing this but it gets stuck:
Note that it is a large graph.
start n = node(*) where n.NID contains "9606.ENS3"
MATCH (n)-[Rel1*1..3]-(m) RETURN m;
Anyone have a clue of how to do traversals on a graph, and getting a result?

Your question shows an old Cypher syntax. The docs says about the START clause:
The START clause should only be used when accessing legacy indexes. In
all other cases, use MATCH instead (see Section 3.3.1, “MATCH”).
I believe this should work:
MATCH(n)-[Rel1*1..3]->(m)
WHERE n.NID contains "9606.ENS3"
RETURN m

Related

Search for node in unorganized binary tree?

This is a conceptual question. I have a tree where the data is stored with strings but not stored alphabetically. How do I search through the entire tree to find the node with string I'm looking for. So far I can only search through one side of the tree.
Here are the thing you can:
1. traverse the tree in any manner, say `DFS` or `BFS`
2. while travering nodes, keep checking the the current node is equivalent to the key string you are searching for.
2.1. compare each character of your search string with each character of current node's value.
2.2. if match found, process your result.
2.3. if not, continue with point 2.
3. if all the nodes exhausted, you don't have any match. stop the algorithm.
The complexity of above mentioned algorithm will be:
O(N)* O(M) => O(NM)
N - nodes of your tree.
M - length of your node's value + length of your search key's value.
You may iterate throught all tree levels and on each of level check all nodes. Depth of the tree is equivalent to numbers of itetations.
You may recursively go down to each branches and stop all itetations when node is found (by using external variable or flag) or if there is no child nodes.

Neo4j traversal performance

I want to perform an undirected traversal to extract all ids connected through a certain type of relationship
When I perform the following query it returns the values fast enough
MATCH path=(s:Node {entry:"a"})-[:RelType*1..10]-(x:Node)
RETURN collect(distinct ID(x))
However doing
MATCH path=(s:Node {entry:"a"})-[:RelType*]-(x:Node)
RETURN collect(distinct ID(x))
takes an huge amount of time. I suspect that by using * it searches every path from s to x, but since I want only the ids these paths can be discarded. What I really want is an BFS or DFS search to find the connect nodes from s.
Both query returns the exact same result since there are no elements with shortest path higher than 5 (only in the test example !).
Did you add an index for create index on :Node(entry) ?
Also depending on the # of rels per node in your path you get rels^10 (or general rels^steps) paths through your graph that are potentially returned.
Can you try first with a smaller upper limit like 3 and work from there?
Also leaving off the direction really hurts as you then get cycles.
What you can also try to do is:
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN ID(X)
and stream the results and do the uniqueness in the client
Or this if you don't want to do uniqueness in the client
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN distinct ID(X)

Getting all relationships among neighbors of a node

I have an embedded graph db of nodes (twitter users) and directed edges (follows).
I'm trying to get all relationships among the users (Set A) who are followed by a specified user (Node U). Also the relationships between the nodes in A and the specified node U.
My query:
START u=node:user_id(user_id={id_of_U})
MATCH p = u-->following, p2= following-[?]->u, p3 = following-[?]->()<--u
RETURN distinct rels(p),rels(p2),rels(p3)
This query gives me what I expect but the problem is, it takes so much time when the specified user is following too many users.
I tried lots of queries and the query above is the best one so far. Yet, I'm sure there are more efficient ways to do this, because when I get those relationships in a java method by walking through all users in "A", getting all relationships for each of them (Direction.BOTH), and then filtering the relationships with "A" (remove relationships that have start or end node that does not belong to "A"), it takes just 8 secs for a user following 500 people, whereas the cypher query cannot even fail without blowing my heap up...
Can you try this one?
start u=node:user_id(user_id={id_of_U})
MATCH u-[r]->following
with u, r, following
match following-[r2?]->u, following-[r3?]->()<-[r4]-u
RETURN distinct r, r2, r3, r4
Also, are you using the latest 1.9?
starting with p = u-->following is not optimal, since it takes all related nodes and later on you try to filter on these nodes. i'd suggest to pick up less nodes and later on expand this set a little bit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[:FOLLOWS]->following
WITH u,following
MATCH u-[r]-following
RETURN distinct r;
this will give you all the relationships between nodes in setA who are also folowed by node U.
in case you dont have the relationship FOLLOW in your graph - you should have, otherwise you graph design is'nt optimal. i noticed you are not using any specific rel type in your query - this can be optimal if and only if you have just 1 relationship type in your data. as far as i understand your question, you got more than 1 rel type.
edit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[]-following
WITH u, following
MATCH u-[r]-again, again-[r2]-following
RETURN r, r2

Where to find balanced tree implementation?

I was wondering whether anyone knows of somewhere on the Internet where I can copy-and-paste (for learning purposes) an implementation of a binary search tree that implements both balancing and an algorithm to ask how many nodes are less than a particular value. I want to be able to query this data structure and ask "How many nodes are < x in this data structure?" The whole purpose is to answer this latter type of query, but the balancing is important too because I want to be able to handle large unbalanced sequences of entries.
I prefer implementations in Python, C, C++, or Java, but others are welcome.
If it's still relevant, you could maintain in each node the size of the subtree, like in
http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=12347&zep=HeightBalancedTree.h&rzp=%2fKB%2farchitecture%2favl_cpp%2f%2favl_cpp.zip
Notice the FindComparable function. If the object is found, it returns its index (the number of nodes that are smaller than the searched object).
In case the searched object is not in the tree, you want to get the index of the minimal node that is larger than the searched object.
Notice what happens to the index when the object is not found.
The last Node::GetSize(Node::GetRight(node))) or Node::GetSize(Node::GetLeft(node))) will be evaluated to 0, so you have 2 cases:
If the last turn was right (cmp > 0) you will get the index of the maximal node that is smaller than the searched object plus one - which is exactly the value that you want.
If the last turn was left (cmp < 0) you will get the index of the minimal node that is larger than the searched object minus one.
Therefore, instead of returning Size() when the object was not found, you could return:
index + (cmp < 0)?1:0;

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

Resources