DELETE relation when using FOREACH in neo4j - performance

I need to DELETE relations of particular type of a node which is iterating over FOREACH.
In detail ::
PROFILE MATCH (n:Label1)-[r1:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH n
WITH COLLECT(n) AS rows
WITH [a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE(l2:Label2{id:540})
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=1, sub.prop2=1484764200)
WITH DISTINCT other_rows, count
FOREACH (sub IN other_rows |
MERGE(l2:Label2{id:540})
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=sub.prop3+1)
RETURN count
As FOREACH is not suppoting MATCH, I used MERGE to achieve it. But it is very slow when I execute it (It is taking around 1 min).
But If I excete with out FOREACH (stop updaing), it is giving around 1 sec.
Problem:: Clearly the problem with FOREACH or inside operations with in FOREACH.
I want to delete a particular relation, create another relation and set some properties to node.
Note:: I showed total query because Is there any other way to achieve the same requirement (out of this FOREACH, I tried with CASE WHEN)

I noticed a few things about your original query:
MERGE(l2:Label2 {id:540}) should be moved out of both FOREACH clauses, since it only needs to be done once. This is slowing down the query. In fact, if you expect the node to already exist, you can use a MATCH instead.
MERGE (sub)-[:APPEND_TO {s:0}]->(l2) may not do what you intended, since it will only match existing relationships in which the s property is still 0. If s is not 0, you will end up creating an additional relationship. To ensure that there is a single relationship and that its s value is (reset to) 0, you should remove the {s:0} test from the pattern and use SET to set the s value; this should also speed up the MERGE, since it will not need to do a property value test.
This version of your query should fix the above issues, and be faster (but you will have to try it out to see how much faster):
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH
[a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
MERGE(l2:Label2 {id:540})
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[r2:APPEND_TO]->(l2)
SET r2.s = 0, sub.prop3 = 1, sub.prop2 = 1484764200)
WITH DISTINCT l2, other_rows, count
FOREACH (sub IN other_rows |
MERGE (sub)-[r3:APPEND_TO]->(l2)
SET r3.s = 0, sub.prop3 = sub.prop3+1)
RETURN count;
If you only intend to set the s value to 0 when the APPEND_TO relationship is being created, then use the ON CREATE clause instead of SET:
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH
[a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
MERGE(l2:Label2 {id:540})
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[r2:APPEND_TO]->(l2)
ON CREATE SET r2.s = 0
SET sub.prop3 = 1, sub.prop2 = 1484764200)
WITH DISTINCT l2, other_rows, count
FOREACH (sub IN other_rows |
MERGE (sub)-[r3:APPEND_TO]->(l2)
ON CREATE r3.s = 0
SET sub.prop3 = sub.prop3+1)
RETURN count;

Instead of FOREACH, you can UNWIND the collection of rows and process those. You can also use OPTIONAL MATCH instead of MERGE, so you avoid the fallback creation behavior of MERGE when a match isn't found. See how this compares:
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH [a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
// faster to do it here, only 1 row so it executes once
MERGE(l2:Label2{id:540})
UNWIND less_than_rows as sub
OPTIONAL MATCH (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=1, sub.prop2=1484764200
WITH DISTINCT other_rows, count, l2
UNWIND other_rows as sub
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=sub.prop3+1
RETURN count

Related

Take top n results from table in power query, where n is dynamic based on a an if function

I want to use Power Query to extract by field(field is [Project]), then get the top 3 scoring rows from the master table for each project, but if there are more than 3 rows with a score of over 15, they should all be included. 3 rows must be extracted every time as minimum.
Essentially I'm trying to combine Keep Rows function with my formula of "=if(score>=15,1,0)"
Setting the query to records with score greater than 15 doesn't work for projects where the highest scores are, for example, 1, 7 and 15. This would only return 1 row, but we need 3 as a minimum.
Setting it to the top 3 scores only would omit rows in a table where the highest scores are 18, 19, 20
Is there a way to combine the two function to say "Choose the top 3 rows, but choose the top n rows if there are n rows with score >= 15
As far as I understand you try to do following (Alexis Olson proposed very same):
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each Table.SelectRows(Table.AddIndexColumn(Table.Sort(_, {"Score", 1}), "i", 1, 1), each [i]<=3 or [Score]>=15)}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each [a = Table.Sort(_, {"Score", 1}), b = Table.FirstN(a, 3) & Table.SelectRows(Table.Skip(a,3), each [Score]>=15)][b]}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"Score", each [a = List.Sort([Score], 1), b = List.FirstN(a,3)&List.Select(List.Skip(a,3), each _ >=15)][b]}),
expand = Table.ExpandListColumn(group, "Score")
in
expand
Note, if there are more columns in the table you want to keep, for first and second variants you may just add these columns to last step. For last variant you haven't such option and the code should be modified.
Sort by the Score column in descending order and then add an Index column (go to Add Column > Index Column > From 1).
Then filter on the Index column choosing to keep values less than or equal to 3. This should produce a step with this M code:
= Table.SelectRows(#"Added Index", each [Index] <= 3)
Now you just need to make a small adjustment to also include any score 15 or greater:
= Table.SelectRows(#"Added Index", each [Index] <= 3 or [Score] >= 15)

UNWIND is not returning in neo4j

Consider the follwing CQL Query
MATCH (n:Label1) WITH n
OPTIONAL MATCH (n)-[r:REL_1]-(:Label2 {id: 5})
WHERE r is NULL OR r.d < 12345 OR (r.d = 12345 OR r.c < 2)
WITH n,r LIMIT 100
WITH COLLECT({n: n, r: r}) AS rows
MERGE (c:Label2 {id: 5})
WITH c,
[b IN rows WHERE b.r.d IS NULL OR b.r.d < 12345] AS null_less_rows,
[c IN rows WHERE (c.r.d = 12345 AND c.r.c < 2)] AS other_rows
WITH null_less_rows, other_rows, c, null_less_rows+other_rows AS rows, size(null_less_rows+other_rows) AS count
UNWIND null_less_rows AS null_less_row
MERGE(s:Label1 {id: null_less_row.n.id})
MERGE(s)-[:REL_1 {d: 12345, c: 1}]->(c)
WITH DISTINCT other_rows, c, rows, count
UNWIND other_rows AS other_row
MATCH(s:Label1 {id: other_row.n.id})-[str:REL_1]->(c) SET str.c = str.c + 1
WITH rows, count
RETURN rows, count
When I excute the query, It should return rows and count (according to query). But instead of returning rows, count it's giving result statement.
Set 200 properties, created 100 relationships, statement completed in 13 ms.
Is there any problem with query structure or problem with improper use of UNWIND clause.
If other_rows is null or empty, UNWIND will not produce any rows.
You could solve it with:
UNWIND case coalesce(size(other_rows),0) when 0 then [null] else other_rows end
as other_row
Addition to Micheael Hunger Answer
UNWIND (CASE other_rows WHEN [] then [{n:{id: -2}}] else other_rows end) AS other_row
As I am operating on values of array, instead of null, I need to add extra condition so that it can't throw any error mesaages.
This apllies for both cases (other_rows, null_less_rows)

How to compute the difference between counts after grouping?

I got data in group into format: (GroupID, count). Like the following, I would like to compute the difference between the count, meanwhile preserve the GroupID. So, it becomes (1, 288) (2, 2), (3,66)....
I tried to use the SUBTRACT function, but not sure how to subtract the previous record from the current one. The second image shows the count part. The subtraction part is failed.
This is little tricky to achieve but can be done using a JOIN.Generate another relation starting with the second row but with ID 1 i.e ($0-1).Join the 2 relations and generate the difference.For Id add 1 to get the original ids.Union the the 1st row with the rows that contain the difference.
A = foreach win_grouped generate $0 as id,count($1) as c; -- (1,228),(2,230)... so on
A1 = filter A by ($0 > 1); -- (2,230),(3,296)... so on
B = foreach A1 generate ($0 - 1) as id,$1 as c; -- (1,230),(2,296)... so on
AB = join A by id,B by id; -- (1,228,1,230),(2,230,2,296)...so on
C = foreach AB generate (A::id + 1),(B::c - A::c) -- (2,2),(3,66)...so on
D = limit A 1; -- (1,288)
E = UNION D,C; -- (1,288),(2,2),(3,66)...so on
DUMP E;

How many common item

Let's say we have these information:
UPDATE
Group A - Item 1, Item 2, Item 3
Group B - Item 1, Item 3
Group C - Item 3, Item 4
I'd like to know which groups contains the most common items:
Output:
Group A - (Item 1 and Item 3)
Group B - (Item 1 and Item 3)
What algorithm would you use?
First of all you have to represent the dataset:
data[A] = {1,2,3}
data[B] = {1,3}
data[C] = {3,4}
It is better to use numbers so you can use for loops, counters, etc.. so:
data[0] = {1,2,3}
data[1] = {1,3}
data[2] = {3,4}
then I would have another data structure with a counter of how many matches between groups you have, so for example matches[A][B] = 2, matches[A][C] = 1 and so on. That is the data structure that you will need to calculate. If you do that, then your problem is reduced to finding the maximum value in that data structure.
for i = 0; i < 3; i++
for item in data[i]
for j = 0; j < 3; j++
//optimize a little bit (match[A][A] doesn't make sense)
if j == i
next
if item in data[j]
matches[i][j]++
Of course you can optimize this some more. For example, we know that matches[A][B] is going to be equal to matches[B][A], sou you can skip those iterations.
So given a list of groups and their contained items, you want to output the identities of the all the groups that have the same, maximum number of items in common with one other group.
Let's get a list of groups and items:
group_items = (
('Group A', ('Item 1', 'Item 2', 'Item 3')),
('Group B', ('Item 1', 'Item 3')),
('Group C', ('Item 3', 'Item 4')),
)
Then let's store the max # items shared value for each group, so we can collect all matching groups at the end. We'll also track the max of the maxes since we can (rather than go back and re-compute it).
max_shared = {item[0]:0 for item in group_items}
num_groups = len(group_items)
group_sets = {}
max_max = 0
Now we're going to have compare every group with every other group, but we can ignore certain comparisons. As #Perroloco mentions, comparing Group A with Group A isn't useful, and computing intersect(A,B) is symmetric with computing intersect(B,A), so we can range from 0 to N and then from i+1 to N, instead of doing 0..N cross 0..N.
I'm using the set data type, which costs something to construct. So I cached the sets because we aren't modifying the membership, just counting the membership of the intersection.
It's worth pointing out that while intersection(A,B) == intersection(B,A), it is not the case that the MAX for A is the same as the MAX for B. Thus, there are separate comparisons for the inner max and the outer max.
for i in range(num_groups):
outer_name, outer_mem = group_items[i]
if outer_name not in group_sets:
group_sets[outer_name] = set(outer_mem)
outer_set = group_sets[outer_name]
outer_max = max_shared[outer_name]
for j in range(i+1, num_groups):
inner_name, inner_mem = group_items[j]
if inner_name not in group_sets:
group_sets[inner_name] = set(inner_mem)
inner_set = group_sets[inner_name]
ni = len(outer_set.intersection(inner_set))
if ni > outer_max:
outer_max = max_shared[outer_name] = ni
if ni > max_max:
max_max = ni
if ni > max_shared[inner_name]:
max_shared[inner_name] = ni
print("Overall max # of shared items:", max_max)
results = [grp for grp,mx in max_shared.items() if mx == max_max]
print("Groups with that many shared items:", results)

How to remove those rows of matrix A, which have equal values with matrix B in specified columns in Matlab?

I have two matrices in Matlab A and B, which have equal number of columns but different number of rows. The number of rows in B is also less than the number of rows in A. B is actually a subset of A.
How can I remove those rows efficiently from A, where the values in columns 1 and 2 of A are equal to the values in columns 1 and 2 of matrix B?
At the moment I'm doing this:
for k = 1:size(B, 1)
A(find((A(:,1) == B(k,1) & A(:,2) == B(k,2))), :) = [];
end
and Matlab complains that this is inefficient and that I should try to use any, but I'm not sure how to do it with any. Can someone help me out with this? =)
I tried this, but it doesn't work:
A(any(A(:,1) == B(:,1) & A(:,2) == B(:,2), 2), :) = [];
It complains the following:
Error using ==
Matrix dimensions must agree.
Example of what I want:
A-B in the results means that the rows of B are removed from A. The same goes with A-C.
try using setdiff. for example:
c=setdiff(a,b,'rows')
Note, if order is important use:
c = setdiff(a,b,'rows','stable')
Edit: reading the edited question and the comments to this answer, the specific usage of setdiff you look for is (as noticed by Shai):
[temp c] = setdiff(a(:,1:2),b(:,1:2),'rows','stable')
c = a(c,:)
Alternative solution:
you can just use ismember:
a(~ismember(a(:,1:2),b(:,1:2),'rows'),:)
Use bsxfun:
compare = bsxfun( #eq, permute( A(:,1:2), [1 3 2]), permute( B(:,1:2), [3 1 2] ) );
twoEq = all( compare, 3 );
toRemove = any( twoEq, 2 );
A( toRemove, : ) = [];
Explaining the code:
First we use bsxfun to compare all pairs of first to column of A and B, resulting with compare of size numRowsA-by-numRowsB-by-2 with true where compare( ii, jj, kk ) = A(ii,kk) == B(jj,kk).
Then we use all to create twoEq of size numRowsA-by-numRowsB where each entry indicates if both corresponding entries of A and B are equal.
Finally, we use any to select rows of A that matches at least one row of B.
What's wrong with original code:
By removing rows of A inside a loop (i.e., A( ... ) = []) you actually resizing A at almost each iteration. See this post on why exactly this is a bad practice.
Using setdiff
In order to use setdiff (as suggested by natan) on only the first two columns you'll need use it's second output argument:
[ignore, ia] = setdiff( A(:,1:2), B(:,1:2), 'rows', 'stable' );
A = A( ia, : ); % keeping only relevant rows, beyond first two columns.
Here's another bsxfun implementation -
A(~any(squeeze(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2)),2),:)
One more that is dangerously close to Shai's solution, but still avoids two permute to one permute -
A(~any(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2),3),:)

Resources