Can we aggregate dynamic number of rows using Talend Open Studio - etl

I'm a beginner in Talend Open Studio, and I'm trying to do the transformation below.
From a SQL Table that contains:
DeltaStock Date
------------------------
+50 (initial stock) J0
+80 J1
-30 J2
... ...
I want to produce this table:
Stock Date
-----------
50 J0
130 J1
100 J2
... ...
Do you think this could be possible using TOS? I thought of using tAggregateRow, but I didn't find it appropriate to my issue.

There's probably an easier way to do this using the tMemorizeRows component but the first thought that comes to mind is to use the globalMap to store a rolling sum.
In Talend it is possible to store an object (any value or any type) in the globalMap so that it can be retrieved later on in the job. This is used automatically if you ever use a tFlowToIterate component which allows you to retrieve the values for that row that is being iterated on from the globalMap.
A very basic sample job might look like this:
In this we have a tJava component that only initialises the rolling sum in the globalMap with the following code:
//Initialise the rollingSum global variable
globalMap.put("rollingSum", 0);
After this we connect this component onSubjobOk to make sure we only carry on if we've managed to put the rollingSum into the globalMap.
I then provide my data using a tFixedFlowInput component which allows me to easily hardcode some values for this example job. You could easily replace this with any input. I have used your sample input data from the question:
We then process the data using a tJavaRow which will do some transformations on the data row by row. I've used the following code which works for this example:
//Initialise the operator and the value variables
String operator = "";
Integer value = 0;
//Get the current rolling sum
Integer rollingSum = (Integer) globalMap.get("rollingSum");
//Extract the operator
Pattern p = Pattern.compile("^([+-])([0-9]+)$");
Matcher m = p.matcher(input_row.deltaStock);
//If we have any matches from the regular expression search then extract the operator and the value
if (m.find()) {
operator = m.group(1);
value = Integer.parseInt(m.group(2));
}
//Conditional to use the operator
if ("+".equals(operator)) {
rollingSum += value;
} else if ("-".equals(operator)) {
rollingSum -= value;
} else {
System.out.println("The operator provided wasn't a + or a -");
}
//Put the new rollingSum back into the globalMap
globalMap.put("rollingSum", rollingSum);
//Output the data
output_row.stock = rollingSum;
output_row.date = input_row.date;
There's quite a lot going on there but basically it starts by getting the current rollingSum from the globalMap.
Next, it uses a regular expression to split up the deltaStock string into an operator and a value. From this it uses the operator provided (plus or minus) to either add the deltaStock to the rollingSum or subtract the deltaStock from the rollingSum.
After this it then adds the new rollingSum back into the globalMap and outputs the 2 columns of stock and date (unchanged).
In my sample job I then output the data using a tLogRow which will print the values of the data to the console. I typically select the table formatting option in it and in this case I get the following output:
.-----+----.
|tLogRow_8 |
|=----+---=|
|stock|date|
|=----+---=|
|50 |J0 |
|130 |J1 |
|100 |J2 |
'-----+----'
Which should be what you were looking for.

You should be able to do it in Talend Open Studio.
I attach here an image with the JOB, the content of the tJavaRow and the execution result.
I left under the tFixedFlowInput used to simulate the input a tJDBCInput that you should use to read the data from your DB. Hopefully you can use a specific tXXXInput for your DB instead of the generic JDBC one.
Here is some simple code in the tJavaRow.
//Code generated according to input schema and output schema
output_row.delta = input_row.delta;
output_row.date = input_row.date;
output_row.rollingSum =
Integer.parseInt(globalMap.get("rollingSum").toString());
int delta = Integer.parseInt(input_row.delta);
output_row.rollingSum += delta;
// Save rolling SUM for next round
globalMap.put("rollingSum", output_row.rollingSum);
Beware of the exceptions in the parseInt(). You should handle them the way you feel right.
In my projects I usually have a SafeParse library that does not throws exceptions but returns a default value I can pass together with the vale to be parsed.

Related

Office Script - Split strings in vector & use dynamic cell address - Run an excel script from Power Automate

I'm completely new on Office Script (with only old experience on Python and C++) and I'm trying to run a rather "simple" Office Script on excel from power automate. The goal is to fill specific cells (always the same, their position shouldn't change) on the excel file.
The power Automate part is working, the problem is managing to use the information sent to Excel, in excel.
The script take three variables from Power automate (all three strings) and should fill specific cells based on these. CMQ_Name: string to use as is.
Version: string to use as is.
PT_Name: String with names separated by a ";". The goal is to split it in as much string as needed (I'm stocking them in an Array) and write each name in cells on top of each other, always starting on the same position (cell A2).
I'm able to use CMQ_Names & Version and put them in the cell they're supposed to go in, I've already make it works.
However, I cannot make the last part (in bold above, part 2 in the code below) work.
Learning on this has been pretty frustrating as some elements seems to sometime works and sometimes not. Newbie me is probably having syntax issues more than anyting...  
function main(workbook: ExcelScript.Workbook,
CMQ_Name: string,
Version: string,
PT_Name: string )
{
// create reference for each sheet in the excel document
let NAMES = workbook.getWorksheet("CMQ_NAMES");
let TERMS = workbook.getWorksheet("CMQ_TERMS");
//------Part 1: Update entries in sheet CMQ_NAMES
NAMES.getRange("A2").setValues(CMQ_Name);
NAMES.getRange("D2").setValues(Version);
//Update entries in sheet CMQ_TERMS
TERMS.getRange("A2").setValues(CMQ_Name);
//-------Part 2: work with PT_Name
//Split PT_Name
let ARRAY1: string[] = PT_Name.split(";");
let CELL: string;
let B: string = "B"
for (var i = 0; i < ARRAY1.length; i++) {
CELL = B.concat(i.toString());
NAMES.getRange(CELL).setValues(ARRAY1[i]);
}
}
  I have several problems:
Some parts (basically anything with red are detected as a problem and I have no idea why. Some research indicated it could be false positive, other not. It's not the biggest problem either as it seems the code sometimes works despite these warnings.
Argument of type 'string' is not assignable to parameter of type '(string | number | boolean)[ ][ ]'.
I couldn't find a way to use a variable as address to select a specific cell to write in, which is preventing the for loop at the end from working.  I've been bashing my head against this for a week now without solving it.
Could you kindly take a look?
Thank you!!
I tried several workarounds and other syntaxes without much success. Writing the first two strings in cells work, working with the third string doesn't.
EDIT: Thanks to the below comment, I managed to make it work:
function main(
workbook: ExcelScript.Workbook,
CMQ_Name: string,
Version: string,
PT_Name: string )
{
// create reference for each table
let NAMES = workbook.getWorksheet("CMQ_NAMES");
let TERMS = workbook.getWorksheet("CMQ_TERMS");
//------Part 0: clear previous info
TERMS.getRange("B2:B200").clear()
//------Part 1: Update entries in sheet CMQ_NAMES
NAMES.getRange("A2").setValue(CMQ_Name);
NAMES.getRange("D2").setValue(Version);
//Update entries in sheet CMQ_TERMS
TERMS.getRange("A2").setValue(CMQ_Name);
//-------Part 2: work with PT_Name
//Split PT_Name
let ARRAY1: string[] = PT_Name.split(";");
let CELL: string;
let B: string = "B"
for (var i = 2; i < ARRAY1.length + 2; i++) {
CELL = B.concat(i.toString());
//console.log(CELL); //debugging
TERMS.getRange(CELL).setValue(ARRAY1[i - 2]);
}
}
You're using setValues() (plural) which accepts a 2 dimensional array of values that contains the data for the given rows and columns.
You need to look at using setValue() instead as that takes a single argument of type any.
https://learn.microsoft.com/en-us/javascript/api/office-scripts/excelscript/excelscript.range?view=office-scripts#excelscript-excelscript-range-setvalue-member(1)
As for using a variable to retrieve a single cell (or set of cells for that matter), you really just need to use the getRange() method to do that, this is a basic example ...
function main(workbook: ExcelScript.Workbook) {
let cellAddress: string = "A4";
let range: ExcelScript.Range = workbook.getWorksheet("Data").getRange(cellAddress);
console.log(range.getAddress());
}
If you want to specify multiple ranges, just change cellAddress to something like this ... A4:C10
That method also accepts named ranges.

If Statement error: Expressions that yield variant data-type cannot be used to define calculated columns

I'm new in Power BI and I'm more used to work with Excel. I try to translate following Excel formula:
=IF(A2="UPL";0;IF(MID(D2;FIND("OTP";D2)+3;1)=" ";"1";(MID(D2;FIND("OTP";D2)+3;1))))
in Power Bi as follows:
Algo =
VAR FindIT = FIND("OTP",Fixed_onTop_Data[Delivery Date],1,0)
RETURN
IF(Fixed_onTop_Data[Delivery Type] = "UPL", 0,
IF(FindIT = BLANK(), 1, MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1))
)
Unfortunately I receive following error message:
Expressions that yield variant data-type cannot be used to define calculated columns.
My values are as follows:
Thank you so much for your help!
You cant mix Two datatypes in your output; In one part of if, you return an INT (literally 0/1), and is second you return a STRING
MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1)
You must unify your output datatype -> everything to string or everything to INT
Your code must be returning BLANK in some cells therefore PowerBI isn't able to choose a data type for the column, wrap your code inside CONVERT(,INTEGER).

How to create missing records within date-time range in pig latin

I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!
Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know
In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;

Process an input file having multiple name value pairs in a line

I am writing kettle transformation.
My input file looks like following
sessionId=40936a7c-8af9|txId=40936a7d-8af9-11e|field3=val3|field4=val4|field5=myapp|field6=03/12/13 15:13:34|
Now, how do i process this file? I am completely at loss.
First step is CSV file input with | as delimiter
My analysis will be based on "Value" part of name value pair.
Has anyone processes such files before?
Since you have already splitted the records into fields of 'key=value' you could use an expression transform to cut the string into two by locating the position of the = character and create two out ports where one holds the key and the other the value.
From there it depends what you want to do with the information, if you want to store them as key/value route them trough a union, or use a router transform to send them to different targets.
Her is an example of an expression to split the pairs:
You could use the Modified Javascript Value Step, add this step after this grouping with pipes.
Now do some parsing javascript like this:
var mainArr = new Array();
var sessionIdSplit = sessionId.toString().split("|");
for(y = 0; y < sessionIdSplit.length; y++){
mainArr[y] = sessionIdSplit[y].toString();
//here you can add another loop to parse again and split the key=value
}
Alert("mainArr: "+ mainArr);

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources