How to compute the difference between counts after grouping?

How to compute the difference between counts after grouping? - hadoop

I got data in group into format: (GroupID, count). Like the following, I would like to compute the difference between the count, meanwhile preserve the GroupID. So, it becomes (1, 288) (2, 2), (3,66)....
I tried to use the SUBTRACT function, but not sure how to subtract the previous record from the current one. The second image shows the count part. The subtraction part is failed.

This is little tricky to achieve but can be done using a JOIN.Generate another relation starting with the second row but with ID 1 i.e ($0-1).Join the 2 relations and generate the difference.For Id add 1 to get the original ids.Union the the 1st row with the rows that contain the difference.
A = foreach win_grouped generate $0 as id,count($1) as c; -- (1,228),(2,230)... so on
A1 = filter A by ($0 > 1); -- (2,230),(3,296)... so on
B = foreach A1 generate ($0 - 1) as id,$1 as c; -- (1,230),(2,296)... so on
AB = join A by id,B by id; -- (1,228,1,230),(2,230,2,296)...so on
C = foreach AB generate (A::id + 1),(B::c - A::c) -- (2,2),(3,66)...so on
D = limit A 1; -- (1,288)
E = UNION D,C; -- (1,288),(2,2),(3,66)...so on
DUMP E;

Related

Take top n results from table in power query, where n is dynamic based on a an if function

I want to use Power Query to extract by field(field is [Project]), then get the top 3 scoring rows from the master table for each project, but if there are more than 3 rows with a score of over 15, they should all be included. 3 rows must be extracted every time as minimum.
Essentially I'm trying to combine Keep Rows function with my formula of "=if(score>=15,1,0)"
Setting the query to records with score greater than 15 doesn't work for projects where the highest scores are, for example, 1, 7 and 15. This would only return 1 row, but we need 3 as a minimum.
Setting it to the top 3 scores only would omit rows in a table where the highest scores are 18, 19, 20
Is there a way to combine the two function to say "Choose the top 3 rows, but choose the top n rows if there are n rows with score >= 15

As far as I understand you try to do following (Alexis Olson proposed very same):
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each Table.SelectRows(Table.AddIndexColumn(Table.Sort(_, {"Score", 1}), "i", 1, 1), each [i]<=3 or [Score]>=15)}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each [a = Table.Sort(_, {"Score", 1}), b = Table.FirstN(a, 3) & Table.SelectRows(Table.Skip(a,3), each [Score]>=15)][b]}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"Score", each [a = List.Sort([Score], 1), b = List.FirstN(a,3)&List.Select(List.Skip(a,3), each _ >=15)][b]}),
expand = Table.ExpandListColumn(group, "Score")
in
expand
Note, if there are more columns in the table you want to keep, for first and second variants you may just add these columns to last step. For last variant you haven't such option and the code should be modified.

Sort by the Score column in descending order and then add an Index column (go to Add Column > Index Column > From 1).
Then filter on the Index column choosing to keep values less than or equal to 3. This should produce a step with this M code:
= Table.SelectRows(#"Added Index", each [Index] <= 3)
Now you just need to make a small adjustment to also include any score 15 or greater:
= Table.SelectRows(#"Added Index", each [Index] <= 3 or [Score] >= 15)

DELETE relation when using FOREACH in neo4j

I need to DELETE relations of particular type of a node which is iterating over FOREACH.
In detail ::
PROFILE MATCH (n:Label1)-[r1:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH n
WITH COLLECT(n) AS rows
WITH [a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE(l2:Label2{id:540})
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=1, sub.prop2=1484764200)
WITH DISTINCT other_rows, count
FOREACH (sub IN other_rows |
MERGE(l2:Label2{id:540})
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=sub.prop3+1)
RETURN count
As FOREACH is not suppoting MATCH, I used MERGE to achieve it. But it is very slow when I execute it (It is taking around 1 min).
But If I excete with out FOREACH (stop updaing), it is giving around 1 sec.
Problem:: Clearly the problem with FOREACH or inside operations with in FOREACH.
I want to delete a particular relation, create another relation and set some properties to node.
Note:: I showed total query because Is there any other way to achieve the same requirement (out of this FOREACH, I tried with CASE WHEN)

I noticed a few things about your original query:
MERGE(l2:Label2 {id:540}) should be moved out of both FOREACH clauses, since it only needs to be done once. This is slowing down the query. In fact, if you expect the node to already exist, you can use a MATCH instead.
MERGE (sub)-[:APPEND_TO {s:0}]->(l2) may not do what you intended, since it will only match existing relationships in which the s property is still 0. If s is not 0, you will end up creating an additional relationship. To ensure that there is a single relationship and that its s value is (reset to) 0, you should remove the {s:0} test from the pattern and use SET to set the s value; this should also speed up the MERGE, since it will not need to do a property value test.
This version of your query should fix the above issues, and be faster (but you will have to try it out to see how much faster):
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH
[a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
MERGE(l2:Label2 {id:540})
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[r2:APPEND_TO]->(l2)
SET r2.s = 0, sub.prop3 = 1, sub.prop2 = 1484764200)
WITH DISTINCT l2, other_rows, count
FOREACH (sub IN other_rows |
MERGE (sub)-[r3:APPEND_TO]->(l2)
SET r3.s = 0, sub.prop3 = sub.prop3+1)
RETURN count;
If you only intend to set the s value to 0 when the APPEND_TO relationship is being created, then use the ON CREATE clause instead of SET:
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH
[a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
MERGE(l2:Label2 {id:540})
FOREACH (sub IN less_than_rows |
MERGE (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[r2:APPEND_TO]->(l2)
ON CREATE SET r2.s = 0
SET sub.prop3 = 1, sub.prop2 = 1484764200)
WITH DISTINCT l2, other_rows, count
FOREACH (sub IN other_rows |
MERGE (sub)-[r3:APPEND_TO]->(l2)
ON CREATE r3.s = 0
SET sub.prop3 = sub.prop3+1)
RETURN count;

Instead of FOREACH, you can UNWIND the collection of rows and process those. You can also use OPTIONAL MATCH instead of MERGE, so you avoid the fallback creation behavior of MERGE when a match isn't found. See how this compares:
PROFILE
MATCH (n:Label1)-[:REL1]-(a:Label2)
WHERE a.prop1 = 2
WITH COLLECT(n) AS rows
WITH [a IN rows WHERE a.prop2 < 1484764200] AS less_than_rows,
[b IN rows WHERE b.prop2 = 1484764200 AND b.prop3 < 2] AS other_rows
WITH size(less_than_rows) + size(other_rows) AS count, less_than_rows, other_rows
// faster to do it here, only 1 row so it executes once
MERGE(l2:Label2{id:540})
UNWIND less_than_rows as sub
OPTIONAL MATCH (sub)-[r:REL2]-(:Label2)
DELETE r
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=1, sub.prop2=1484764200
WITH DISTINCT other_rows, count, l2
UNWIND other_rows as sub
MERGE (sub)-[:APPEND_TO {s:0}]->(l2)
SET sub.prop3=sub.prop3+1
RETURN count

PIG - trying to find max of group of months in table

The above image represents generate statement below and describe too
D = FOREACH C GENERATE $0 AS time, $1 AS perf_temp_count;
DUMP D;
DESCRIBE D;
MY question is curretnly the above is grouped my Month and Hour(miltary time) and i am trying to find the max number next to it per each month. 1 through 12, right now i am just showing the month, hours, and numbers.
My expected out put is
(1, 4) 9
....
remaning months
....
(12, 3) 10
Where this again descibes ( Month, hour), Max count

B = GROUP A BY (month, hour);
C= FOREACH B GENERATE group as time,COUNT(A.temp) as cnt
X = GROUP C By time;
Y = FOREACH X GENERATE group, MAX(C.cnt) as mcount;
I have no idea why, but Agrregating(MAX) right after another aggregate(COUNT) is a problem or I am not refrencing the names correctly.

How to remove those rows of matrix A, which have equal values with matrix B in specified columns in Matlab?

I have two matrices in Matlab A and B, which have equal number of columns but different number of rows. The number of rows in B is also less than the number of rows in A. B is actually a subset of A.
How can I remove those rows efficiently from A, where the values in columns 1 and 2 of A are equal to the values in columns 1 and 2 of matrix B?
At the moment I'm doing this:
for k = 1:size(B, 1)
A(find((A(:,1) == B(k,1) & A(:,2) == B(k,2))), :) = [];
end
and Matlab complains that this is inefficient and that I should try to use any, but I'm not sure how to do it with any. Can someone help me out with this? =)
I tried this, but it doesn't work:
A(any(A(:,1) == B(:,1) & A(:,2) == B(:,2), 2), :) = [];
It complains the following:
Error using ==
Matrix dimensions must agree.
Example of what I want:
A-B in the results means that the rows of B are removed from A. The same goes with A-C.

try using setdiff. for example:
c=setdiff(a,b,'rows')
Note, if order is important use:
c = setdiff(a,b,'rows','stable')
Edit: reading the edited question and the comments to this answer, the specific usage of setdiff you look for is (as noticed by Shai):
[temp c] = setdiff(a(:,1:2),b(:,1:2),'rows','stable')
c = a(c,:)
Alternative solution:
you can just use ismember:
a(~ismember(a(:,1:2),b(:,1:2),'rows'),:)

Use bsxfun:
compare = bsxfun( #eq, permute( A(:,1:2), [1 3 2]), permute( B(:,1:2), [3 1 2] ) );
twoEq = all( compare, 3 );
toRemove = any( twoEq, 2 );
A( toRemove, : ) = [];
Explaining the code:
First we use bsxfun to compare all pairs of first to column of A and B, resulting with compare of size numRowsA-by-numRowsB-by-2 with true where compare( ii, jj, kk ) = A(ii,kk) == B(jj,kk).
Then we use all to create twoEq of size numRowsA-by-numRowsB where each entry indicates if both corresponding entries of A and B are equal.
Finally, we use any to select rows of A that matches at least one row of B.
What's wrong with original code:
By removing rows of A inside a loop (i.e., A( ... ) = []) you actually resizing A at almost each iteration. See this post on why exactly this is a bad practice.
Using setdiff
In order to use setdiff (as suggested by natan) on only the first two columns you'll need use it's second output argument:
[ignore, ia] = setdiff( A(:,1:2), B(:,1:2), 'rows', 'stable' );
A = A( ia, : ); % keeping only relevant rows, beyond first two columns.

Here's another bsxfun implementation -
A(~any(squeeze(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2)),2),:)
One more that is dangerously close to Shai's solution, but still avoids two permute to one permute -
A(~any(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2),3),:)

Pig 10.0 - group the tuples and merge bags in a foreach

I'm using Pig 10.0. I want to Merge bags in a foreach. Let's say I have the following visitors alias:
(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})
I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:
({1, 2, 3, 4, 6, 7}, a, 6)
({1, 2, 3}, z, 3)
The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.
I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion();
grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
VU = SetUnion(visitors.ThirdField);
GENERATE
VU as Vu,
group as FirstField,
COUNT(VU) as Cnt;
}
dump merged;
Can you explain where I'm wrong and how to implement the desired behavior?

I finally managed to achieve the wanted behavior. A self contained example of my solution follows:
Data file:
a b 1
a b 2
a b 3
a b 4
a d 1
a b 3
a b 6
a e 7
z b 1
z b 2
z b 3
Code:
-- Prepare data
in = LOAD 'data' USING PigStorage()
AS (One:chararray, Two:chararray, Id:long);
grp = GROUP in by (One, Two);
cnt = FOREACH grp {
ids = DISTINCT in.Id;
GENERATE
ids as Ids,
group.One as One,
group.Two as Two,
COUNT(ids) as Count;
}
-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids generate FLATTEN($0);
GENERATE
ids as Ids,
group as One,
COUNT(ids) as Count;
}
describe cnt2;
dump grp2;
dump cnt2;
Describe:
Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}
grp2:
(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})
cnt2:
({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)
Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.
I will let the question as unresolved for a few days since a cleaner solution probably exists.

Found a simpler solution for this.
current_input = load '/idn/home/ksing143/tuple_related_data/tough_grouping.txt' USING PigStorage() AS (col1:chararray, col2:chararray, col3:int);
/* But we do not need column 2. Hence eliminating to avoid confusion */
relevant_input = foreach current_input generate col1, col3;
relevant_distinct = DISTINCT relevant_input;
relevant_grouped = group relevant_distinct by col1;
/* This will give */
(a,{(a,1),(a,2),(a,3),(a,4),(a,6),(a,7)})
(z,{(z,1),(z,2),(z,3)})
relevant_grouped_advance = foreach relevant_grouped generate (relevant_distinct.col3) as col3, group, COUNT(relevant_distinct.col3) as count_val;
/* This will give desired result */
({(1),(2),(3),(4),(6),(7)},a,6)
({(1),(2),(3)},z,3)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to compute the difference between counts after grouping? - hadoop

Related

Take top n results from table in power query, where n is dynamic based on a an if function

DELETE relation when using FOREACH in neo4j

PIG - trying to find max of group of months in table

How to remove those rows of matrix A, which have equal values with matrix B in specified columns in Matlab?

Pig 10.0 - group the tuples and merge bags in a foreach

Categories

Resources