Delete rows based on certain logic in power query - powerquery

I need to delete rows based on the below logic:
Sum of column B for the same product, to compare with one of the values in column D for this product.
If the sum value < the value in column D, then delete the rows with extra ReceiptQty. In this case, for product AAA, receiptQty =12000, which is >10000, then delete the row 7.
Is there any way to achieve this in power query? Thanks~

This code should work:
let
Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],
group = Table.Group(Source, {"ProductID"}, {"temp", each _}),
list = Table.AddColumn(group, "list", each List.Skip(List.Accumulate([temp][ReceiptQty], {0}, (a, b) => a & {List.Last(a) + b}))),
table = Table.AddColumn(list, "table", each Table.FromColumns(Table.ToColumns([temp])&{[list]}, Table.ColumnNames(Source)&{"RunningQty"})),
final = Table.SelectRows(Table.Combine(table[table]), each [OnhandQty] >= [RunningQty])
in
final

Related

Power query, iterate over the column records to apply a custom cumulative calculation

Using Power Query in Excel. I am trying to implement a custom column that would iteratively calculate the row based on the previous row's value of the same column.
I have a 3 column table and the 4th column will be the calculation column that I am failing to implement.
The calculation is very easy to apply in Excel which goes as follows:
Formula in cell D3 --> = =IF(A3=1,C3+6.4,IF(C3+D2>=12.8,12.8,IF(C3+D2<=1.28,1.28,C3+D2)))
The same formula is applied to the whole column by dragging.
The idea behind it:
For each category, I have an index column starting from 1,
If Index = 1, then Calculation is Value + 6.4,
else if Value + Value(previous row Custom cumulative) >= 12.8 then 12.8
else if Value + Value(previous row Custom cumulative) <= 1.28 then 1.28
else Value + Value(previous row Custom cumulative)
So, the calculation is a cumulative sum with an upper and lower cap built into it.
How can I implement this in Power Query and M-Language?
I really appreciate your help!
I have tried to use List.Generate and List.Accumulate features, however, I was stuck with creating records that has values from multiple columns in it.
Try this
(edited to make more efficient with single pass process)
let Source = Excel.CurrentWorkbook(){[Name="Table15"]}[Content],
process = (zzz as list) => let x= List.Accumulate( zzz,{0},( state, current ) =>
if List.Last(state) =0 then List.Combine ({state,{6.4+current}}) else
if List.Last(state)+current >=12.8 then List.Combine ({state,{12.8}}) else
if List.Last(state)+current <=1.28 then List.Combine ({state,{1.28}}) else
List.Combine ({state,{List.Last(state)+current}})
) in x,
#"Grouped Rows" = Table.Group(Source, {"Category"}, {{"data", each
let a=process(_[Values])
in Table.AddColumn(_, "Custom Cumulative", each a{[Index]}), type table }}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Index", "Values", "Custom Cumulative"}, {"Index", "Values", "Custom Cumulative"})
in #"Expanded data"

(Power Query) Complicated sort

I have a complicated sorting that I want, and I'm just not sure how to get power query to do it. The TLDR version is "oldest new ones first, then newest old ones." So I want to split the sort between ascending/descending depending on what data are in the columns.
Certain columns on my sheet (I through K) contain the word 'Yes' if it is a new item, otherwise blank. Possible combinations of columns that have 'yes' in them:
I only, J only, K only, I + J, J + K, I + J + K
Here's the sort logic I want:
All rows with a Yes in K are listed first, ascending by date (column H), whether they have 'Yes' in columns I or J or not.
Then, Of only the rows that are left, all rows with a Yes in J, ascending by date (column H)
Next, Of only the rows that are left, all rows with a Yes in I, ascending by date (column H)
Finally, the only rows left should not have a Yes in any columns I-K. Of those rows, DEscending by date (Column H).
I can sort of maybe figure out how to do the sort up through step 3 by creating a custom column to label and identifying whether the row will go in the first, second, or third sort, then sorting by that custom column before sorting the others.
But step 4 is stumping me because of the reverse to descending instead of ascending. I'm thinking maybe grouping the data, sorting it within the group descending and outside the group ascending (as a 4th entry in the custom column that sorted the first 3), and then expanding it back out again after the external sort, or something?
Please help!
Currently I'm only able to sort the sheet ascending and can't sort part of it descending.
Filter a column, then sort it. Filter another column and sort it. etc. Put them together
Load your data into powerquery (data ... from table/range ... )and use code below pasted into home ... advanced editor.... It assumes your data is loaded as Table1 with column headers A,H,I,J,K, so change that to reflect your actual table name and column names. If you have your own code, remove the first row and change the Source in the second row to reflect your #"PriorStepName"
sample code to transform image below on left to image on right:
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"A", Int64.Type}, {"H", type date}, {"I", type text}, {"J", type text}, {"K", type text}}),
Part1 = Table.Sort(Table.SelectRows(#"Changed Type", each ([K] = "Yes")),{{"H", Order.Ascending}}),
Part2 = Table.Sort(Table.SelectRows(#"Changed Type", each ([K] <> "Yes" and [J] = "Yes")),{{"H", Order.Ascending}}),
Part3 = Table.Sort(Table.SelectRows(#"Changed Type", each ([K] <> "Yes" and [J] <> "Yes" and [I] = "Yes")),{{"H", Order.Ascending}}),
Part4 = Table.Sort(Table.SelectRows(#"Changed Type", each ([K] <> "Yes" and [J] <> "Yes" and [I] <> "Yes")),{{"H", Order.Descending}}),
Combined = Table.Combine({Part1,Part2,Part3,Part4})
in Combined

Powerquery: passing column value to custom function

I'm struggling on passing the column value to a formula. I tried many different combinations but I only have it working when I hard code the column,
(tbl as table, col as list) =>
let
avg = List.Average(col),
sdev = List.StandardDeviation(col)
in
Table.AddColumn(tbl, "newcolname" , each ([column] - avg)/sdev)
I'd like to replace [column] by a variable. In fact, it's the column I use for the average and the standard deviation.
Please any help.
Thank you
This probably does what you want, called as x= fctn(Source,"ColumnA")
Does the calculations using and upon ColumnA from Source table
(tbl as table, col as text) =>
let
avg = List.Average(Table.Column(tbl,col)),
sdev = List.StandardDeviation(Table.Column(tbl,col))
in Table.AddColumn(tbl, "newcolname" , each (Record.Field(_, col) - avg)/sdev)
Potentially you want this. Does the average and std on the list provided (which can come from any table) and does the subsequent calculations on the named column in the table passed over
called as x = fctn(Source,"ColumnNameInSource",SomeSource[SomeColumn])
(tbl as table, cname as text, col as list) =>
let
avg = List.Average(col),
sdev = List.StandardDeviation(col)
in Table.AddColumn(tbl, "newcolname" , each (Record.Field(_, cname) - avg)/sdev)

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Resources