Aggregation Operation in Kettle / Pentaho - business-intelligence

I'm trying to do an aggregate operation between some columns from an Excel file input. I have the following case:
Column 1 Column 2 Column 3
X $15 A
X $20 A
Y $1 B
Y $1 B
Y $3 C
And i want to achieve this aggregation operation:
Column 1 Column 2 Column 3
X $35 A
Y $2 B
Y $3 C
As you see, the Column 1 and 3 are the criteria for doing the aggregation operation, in this case, i want to get the sum of the column 2.
Is there any way to do this in Pentaho Data Integration? I've tried with "Join Rows" and "Join Rows (As a cartesian product)", but, i have no results.

Please look to Group By step. It should allow you to group by Column 1 and Column 3 and sum Column 2.

Related

Show number of distinct values in a grouped row in a matrix visual in Power BI

I have a matrix visual in PowerBI with two row fields, rf1 and rf2. This goups row field 2 (rf2) by row field 1 (rf1) such that each value in rf1 contains multiple values from rf2. rf1 and rf2 are stored in different tables in the data model, but the tables are connected directly.
I would like to show on the matrix visual the number of unique rf2 values within each rf1 against the corresponding row.
For example (first two colums as collapsable groups as in the matrix visual):
rf1
rf2
rf2 count
Values
group1
3
10
a
3
b
1
c
6
group2
2
5
a
2
d
3
Tot
--------------
5
15
What measure do I need to be able to generate this view?

Best approach for formula of matching values in sheet 2

I need to populate column A in sheet two based on multiple columns in sheet one.
For example, here are two of multiple conditions:
If columns A,B,C,D (of sheet 2) are all 5/6 then populate corresponding row in sheet one with "mid".
If columns A,B,C,D (of sheet 2) contain at least one 3 and L,M,O contain all 0s, populate "low".
I believe using SWITCH would make the most sense, unless someone can reccommend a simpler approach?
My main issue is with the syntax of writing this, I am getting a formula parse error:
=SWITCH(Sheet 1!G2:G&K2:K,ISBETWEEN(5,6),"mid")
Sheet 1
A B C D E F G H I J K L M N O
2 2 3 2 0 0 0 0
5 5 6 6
In row one of my example sheet 2 would get "mid" and row 2 would get "low"
try:
=ARRAYFORMULA(
IF( 4=LEN(REGEXREPLACE(FLATTEN(QUERY(TRANSPOSE(A1:D5),,9^9)), "[^5-6]+", )), "mid",
IF((4=LEN(REGEXREPLACE(FLATTEN(QUERY(TRANSPOSE(L1:O5),,9^9)), "[^0]+", )))*(REGEXMATCH(FLATTEN(QUERY(TRANSPOSE(A1:D5),,9^9)), "3")), "low", )))

How to convert a table in a matrix in KNIME?

Is it possible and if so, how to convert a table to a matrix?
My output table is structured as follows:
rowid user item value
0 x A 10
1 x B 15
2 x C 0
3 y A 12
4 y B 17
5 y C 25
My goal is to create a matrix in the following form:
rowid A B C
x 10 15 0
y 12 17 25
Use a Pivoting node with the following settings:
Group column(s) user
Pivot column(s) item
Manual Aggregation > Column value
Advanced Settings > Column name: Pivot name
You can leave the Aggregation set to First.
Connect the Pivot table output to a RowID node with settings:
Replace RowID with… checked
New RowID column user
Remove selected column checked

SAS grouping algorithm

I have the following mock up table
#n a b group
1 1 1 1
2 1 2 1
3 2 2 1
4 2 3 1
5 3 4 2
6 3 5 2
7 4 5 2
I am using SAS for this problem. In column group, the rows that are interconnected through a and b are grouped. I will try to explain why these rows are in the same group
row 1 to 2 are in group 2 since they both have a = 1
row 3 is in group 2 since b = 2 in row 2 and 3 and row 2 is in group 1
row 3 and 4 are in group 1 since a = 2 in both rows and row 3 is in group 1
The overall logic is that if a row x contains the same value of a or b as row y, row x also belongs to the same group as y is a part of.
Following the same logic, row 5,6 and 7 are in group 2.
Is there any way to make an algorithm to find these groups?
Case I:
Grouping defined as to be item linkage within contiguous rows.
Use the LAG function to examine both variables prior values. Increase the group value if both have changed. For example
group + ( a ne lag(a) and b ne lag(b) );
Case II:
Grouping determined from pair item slot value linkages over all data.
From grouping pairs by either key
General statement of problem:
-----------------------------
Given: P = p{i} = (p{i,1},p{i,2}), a set of pairs (key1, key2).
Find: The distinct groups, G = g{x}, of P,
such that each pair p in a group g has this property:
key1 matches key1 of any other pair in g.
-or-
key2 matches key2 of any other pair in g.
Demonstrates
… an iterative way using hashes.
Two hashes maintain the groupId assigned to each key value.
Two additional hashes are used to maintain group mapping paths.
When the data can be passed without causing a mapping, then the groups have
been fully determined.
A final pass is done, at which point the groupIds are assigned to each
pair and the data is output to a table.

using awk to average specified rows

I have a data file set up like
a 1
b 2
c 3
d 4
a 5
b 6
c 7
d 6
etc
and I would like to output to a new file
a average of 2nd column from all "a" rows
b average of 2nd column from all "b" rows
etc
where a, b, c... are also numbers.
I have been able to do this for specific values (1.4 in the example below) of the 1st column using awk:
awk '{ if ( $1 == 1.4) total += $2; count++ }
END {print total/10 }' data
though count is not giving me the correct about of rows (i.e. count should be 10 as I have manually put in 10 to do the average in the last line).
I assume a for loop will be required but I have not been able to implement that correctly.
Please help. Thanks.
awk '{a[$1]+=$2;c[$1]++}END{for(x in a)printf "average of %s is %.2f\n",x,a[x]/c[x]}'
the output of above line (with your example input) is:
average of a is 3.00
average of b is 4.00
average of c is 5.00
average of d is 5.00

Resources