I have an embedded graph db of nodes (twitter users) and directed edges (follows).
I'm trying to get all relationships among the users (Set A) who are followed by a specified user (Node U). Also the relationships between the nodes in A and the specified node U.
My query:
START u=node:user_id(user_id={id_of_U})
MATCH p = u-->following, p2= following-[?]->u, p3 = following-[?]->()<--u
RETURN distinct rels(p),rels(p2),rels(p3)
This query gives me what I expect but the problem is, it takes so much time when the specified user is following too many users.
I tried lots of queries and the query above is the best one so far. Yet, I'm sure there are more efficient ways to do this, because when I get those relationships in a java method by walking through all users in "A", getting all relationships for each of them (Direction.BOTH), and then filtering the relationships with "A" (remove relationships that have start or end node that does not belong to "A"), it takes just 8 secs for a user following 500 people, whereas the cypher query cannot even fail without blowing my heap up...
Can you try this one?
start u=node:user_id(user_id={id_of_U})
MATCH u-[r]->following
with u, r, following
match following-[r2?]->u, following-[r3?]->()<-[r4]-u
RETURN distinct r, r2, r3, r4
Also, are you using the latest 1.9?
starting with p = u-->following is not optimal, since it takes all related nodes and later on you try to filter on these nodes. i'd suggest to pick up less nodes and later on expand this set a little bit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[:FOLLOWS]->following
WITH u,following
MATCH u-[r]-following
RETURN distinct r;
this will give you all the relationships between nodes in setA who are also folowed by node U.
in case you dont have the relationship FOLLOW in your graph - you should have, otherwise you graph design is'nt optimal. i noticed you are not using any specific rel type in your query - this can be optimal if and only if you have just 1 relationship type in your data. as far as i understand your question, you got more than 1 rel type.
edit:
START u=node:user_id(user_id={id_of_U})
MATCH u-[]-following
WITH u, following
MATCH u-[r]-again, again-[r2]-following
RETURN r, r2
Related
We're given two strings that act as search queries. We need to determine if they're the same.
For example:
Query 1: stock price rate
Query 2: share cost rate
We're also given a list containing where each entry has two words that are synonyms. the words could be repeated meaning a transitive relation exists. Something like this:
[
[cost,price]
[rate,price]
[share,equity]
]
Goal is determine whether the queries mean the same thing.
I've proposed a solution where i group similar meaning words into lists and doing an exhaustive search until we find the word from query1 and then searching it's group for word from query 2. But the interviewer wanted a more efficient approach which i couldn't figure out. Is there a more efficient way to solve this issue?
Here is a solution that would allow to tell if 2 queries are similar in near constant time (O(size of queries)), with precomputing in O(number of words in database).
Precomputing: We assume that you have a list of lists of synonyms L
function build_hashmap(L):
H <- new Hashmap()
i <- 0
for each synonyms_list in L do:
for each word in synonyms_list do:
H[word] <- i
i <- i+1
return H
Now we can test if two words are synonyms using H
function is_synonym(w1, w2, H):
if H[w1] == H[w2]:
return true
else:
return False
From there it should be rather easy to tell if two queries have the same meaning.
Edit:
A fast solution could be to implement 'union-find' algorithm in order to build the hashmap.
Another way would be to first model the words as vertices of a graph, and to add edges for relations of synonymity.
Then you can build your hashmap by finding the connected components of the graph. Finding connected components in a graph can be done by traversing it.
I'm using neo4j as a graph database and I want to return from a starting node neighbors of that node, and all the related neighbors to a depth varying from 1 to 3. I'm Doing this but it gets stuck:
Note that it is a large graph.
start n = node(*) where n.NID contains "9606.ENS3"
MATCH (n)-[Rel1*1..3]-(m) RETURN m;
Anyone have a clue of how to do traversals on a graph, and getting a result?
Your question shows an old Cypher syntax. The docs says about the START clause:
The START clause should only be used when accessing legacy indexes. In
all other cases, use MATCH instead (see Section 3.3.1, “MATCH”).
I believe this should work:
MATCH(n)-[Rel1*1..3]->(m)
WHERE n.NID contains "9606.ENS3"
RETURN m
my problem is the following. I have a small but dense network in Neo4j (~280 nodes, ~3600 relationships). There is only one type of node and one type of edge (i.e. a single label for each). Now, I'd like to specify two distinct groups of nodes, given by values for their "group" property, and match the subgraph consisting of all paths up to a certain length connecting the two groups. In addition I would like to add constraints on the relations. So, at the moment I have this:
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE ALL(c IN r WHERE c.weight > {w})
AND ALL(n in NODES(p) WHERE 1=length(filter(m in NODES(p) WHERE m=n)))
WITH DISTINCT r AS dr, NODES(p) AS ns
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
which achieves what I want but in some cases seems to be too slow. Here the WHERE statement filters out paths with relationships whose weight property is below a threshold as well as those containing cycles.
The last three lines have to do with the desired output format. Given the matching subgraph (paths), I want all unique relationships in one list, and all unique nodes in another (for visualization with d3.js). The only way I found to do this is to UNWIND all elements and then COLLECT them as DISTINCT.
Also note that the group properties and the weight limit are passed in as query parameters.
Now, is there any way to achieve the same result faster? E.g., with paths up to a length of 3 the query takes about 5-10 seconds on my local machine (depending on the connectedness of the chosen node groups), and returns on the order of ~50 nodes and a few hundred relationships. This seems to be in reach of acceptable performance. Paths up to length 4 however are already prohibitive (several minutes or never returns).
Bonus question: is there any way to specify the upper limit on path length as a parameter? Or does a different limit imply a totally different query plan?
This probably won't work at all, but it might give you something to play with. I tried changing a few things that may or may not work.
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE r.weight > {w}
WITH n1, NODES(p) AS ns, n2, DISTINCT r AS dr
WHERE length(ns) = 1
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
I want to perform an undirected traversal to extract all ids connected through a certain type of relationship
When I perform the following query it returns the values fast enough
MATCH path=(s:Node {entry:"a"})-[:RelType*1..10]-(x:Node)
RETURN collect(distinct ID(x))
However doing
MATCH path=(s:Node {entry:"a"})-[:RelType*]-(x:Node)
RETURN collect(distinct ID(x))
takes an huge amount of time. I suspect that by using * it searches every path from s to x, but since I want only the ids these paths can be discarded. What I really want is an BFS or DFS search to find the connect nodes from s.
Both query returns the exact same result since there are no elements with shortest path higher than 5 (only in the test example !).
Did you add an index for create index on :Node(entry) ?
Also depending on the # of rels per node in your path you get rels^10 (or general rels^steps) paths through your graph that are potentially returned.
Can you try first with a smaller upper limit like 3 and work from there?
Also leaving off the direction really hurts as you then get cycles.
What you can also try to do is:
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN ID(X)
and stream the results and do the uniqueness in the client
Or this if you don't want to do uniqueness in the client
MATCH path=(s:Node {entry:"a"})-[:RelType*]->(x:Node)
RETURN distinct ID(X)
We have social graph that is later broken to clusters of high cohesion. Something called Truss by Jonathan Cohen [1].
Now that I have those clusters, I would like to come up with names for them.
Cluster name should allow insignificant changes to the cluster size without changing the name.
For example:
Let's assume we have cluster M:
M : {A, B, C, D, E, F}
and let's assume that "naming algorithm" generated name " m " for it.
After some time, vertex A has left the cluster, while vertex J has joined:
M : {B, C, D, E, F, J}
Newly generated name is " m' ".
Desired feature:
m' == m for insignificant cluster changes
[1] http://www.cslu.ogi.edu/~zak/cs506-pslc/trusses.pdf
Based on your example, I assume you mean "insignificant changes to the cluster composition", not to the "cluster size".
If your naming function f() cannot use the information about the existing name for the given cluster, you would have to allow that sometimes it does rename despite the change being small. Indeed, suppose that f() never renames a cluster when it changes just a little. Starting with cluster A, you can get to any other cluster B by adding or removing only one element at a time. By construction, the function will return the same name for A and B. Since A, B were arbitrary, f() will return the same name for all possible clusters - clearly useless.
So, you have two alternatives:
(1) the naming function relies on the existing name of a cluster, or
(2) the naming function sometimes (rarely) renames a cluster after a very tiny change.
If you go with alternative (1), it's trivial. You can simply assign names randomly, and then keep them unchanged whenever the cluster is updated as long as it's not too different (however you define different). Given how simple it is, I suppose that's not what you want.
If you go with alternative (2), you'll need to use some information about the underlying objects in the cluster. If all you have are links to various objects with no internal structure, it can't be done, since the function wouldn't have anything to work with apart from cluster size.
So let's say you have some information about the objects. For example, you may have their names. Call the first k letters of each object's name the object's prefix. Count all the different prefixes in your cluster, and find the n most common ones. Order these n prefixes alphabetically, and append them to each other in that order. For a reasonable choice of k, n (which should depend on the number of your clusters and typical object name lengths), you would get the result you seek - as long as you have enough objects in each cluster.
For instance, if objects have human names, try k = 2; and if you have hundreds of clusters, perhaps try n = 2.
This of course, can be greatly improved by remapping names to achieve a more uniform distribution, handling the cases where two prefixes have similar frequencies, etc.