I'm a newbie here in Talend and what I am trying to do is to split different output file, here is an example of the file that I am working on...
File of Example
For example:
Whenever I see a column with a true value I need to produce different file of it including the row with the value of true.
So the output should be like this:
output example
output example
output example
Thanks in advance guys, hope someone could help me.
I would try this :
tFileInputExcel -> main -> tReplicate -> main -> tFilterRow (quoteName equals quote1
) -> tlog
-> main -> tFilterRow (quoteName equals
quote2
) -> tlog
-> main -> tFilterRow (quoteName equals quote3
) -> tlog
-> main -> tFilterRow (quoteName equals quote4
) -> tlog
Or using a tMap component
would do the staff using filtering optin in the output
Here is a dynamic solution.
Your input file should be sorted on "quoteName"
tFileInput : read your file
tFilterRow : filter on isLastItem : only "true" value (so you'll get only one row for each quoteName)
tflowToIterate : convert your flow into an iteration : you'll have n iterations created (n being the number of distinct quoteNames).
tFileInput : re-read your entire file in your current Iteration
tFilterRow : filter on quote=((String)globalMap.get("row2.quote")) (with row2.quote being the value of your globalVariable created by tFlowToIterate)
tFileOutput : output file. You can put something like : "C:/Temp/"+((String)globalMap.get("row3.quote"))+".txt" in order to generate your distinct files.
Use "Outline" view to get access to global variables created by tFlowToIterate.
I have a map which contains a map.
Map>
For all entries in the map, I want to calculate the sum of a particular key.
For example my map is something like this:
Key1 Key2 Value
A Z 10.10
B Z 40.10
C Y 20.10
I want to calculate basically the sum of all the key2 which is equal to B. So in this case I want to get 50.20 as Key1 -C does not have key2 Z
I am trying to do this using Java 8. I am not sure how I should collect the sum.
double sum = 0;
myMap.forEach((key1, key2) -> {
sum += key2.get("Z");
});
But then I get an error saying that value inside lambda should be a final.
All external variables used within the anonymous inner class or Lambda need to be final or effectively final(a non-final variable that is never reassigned).
In your solution, you are trying to fix classical imperative solution with a functional one.
An idiomatic Java-8 approach would be to use Stream API:
map.values().stream()
.map(x -> x.get("Z"))
.reduce(0, Double::sum);
or utilize the specialized Stream for doubles:
map.values().stream()
.mapToDouble(x -> x.get("Z"))
.sum()
Remember to properly handle edge cases. This will explode if there is no value associated with the "Z" key.
You could use a Stream. That way you could use intermediate operations, too:
myMap.entrySet().stream()
.filter(entry -> entry.getValue().equals(Z))
.map(entry -> entry.getValue())
.mapToDouble(v -> v.get("Z"))
.sum()
I am not sure about your data structure, so this might need a little work, but I hope you get the idea.
https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
Your approach does not work, because you try to modify a local variable in a scope where it can't be modified. See http://docs.oracle.com/javase/tutorial/java/javaOO/localclasses.html
tl;dr You can not modify local variables in a lambda body.
Or you can use AtomicInteger and it's threadSafe
You got your answer how to do it correctly (your example) with streams. Sometimes that is not feasible though (even inside jdk sources there are places where an array wrapper is needed):
double [] sum = {0};
myMap.forEach((key1, key2) -> {
sum[0] += key2.get("Z");
});
Split the group record in to different records :
for eg :
Input : (A,(3,2,3))
Output in to 3 new lines:
A,3
A,2
A,3
Can any one let me know the option to do this please?
The problem is when you convert the output of Arraylist to tuple then it will be difficult to achieve what you want, so I recommend this approach, so it will be easy to get the output .
In your UDF code, instead of creating Arraylist, append the output into string seperated by comma and return back to pig script.
You final output should be like this from UDF as a string ie "3,2,3"
Then use the below code to get the result
C = FOREACH B GENERATE $0,NewRollingCount(BagToString($1)) AS rollingCnt
D = FOREACH C GENERATE $0,FLATTEN(TOKENIZE(rollingcnt));
DUMP D;
I'm a beginner in Talend Open Studio, and I'm trying to do the transformation below.
From a SQL Table that contains:
DeltaStock Date
------------------------
+50 (initial stock) J0
+80 J1
-30 J2
... ...
I want to produce this table:
Stock Date
-----------
50 J0
130 J1
100 J2
... ...
Do you think this could be possible using TOS? I thought of using tAggregateRow, but I didn't find it appropriate to my issue.
There's probably an easier way to do this using the tMemorizeRows component but the first thought that comes to mind is to use the globalMap to store a rolling sum.
In Talend it is possible to store an object (any value or any type) in the globalMap so that it can be retrieved later on in the job. This is used automatically if you ever use a tFlowToIterate component which allows you to retrieve the values for that row that is being iterated on from the globalMap.
A very basic sample job might look like this:
In this we have a tJava component that only initialises the rolling sum in the globalMap with the following code:
//Initialise the rollingSum global variable
globalMap.put("rollingSum", 0);
After this we connect this component onSubjobOk to make sure we only carry on if we've managed to put the rollingSum into the globalMap.
I then provide my data using a tFixedFlowInput component which allows me to easily hardcode some values for this example job. You could easily replace this with any input. I have used your sample input data from the question:
We then process the data using a tJavaRow which will do some transformations on the data row by row. I've used the following code which works for this example:
//Initialise the operator and the value variables
String operator = "";
Integer value = 0;
//Get the current rolling sum
Integer rollingSum = (Integer) globalMap.get("rollingSum");
//Extract the operator
Pattern p = Pattern.compile("^([+-])([0-9]+)$");
Matcher m = p.matcher(input_row.deltaStock);
//If we have any matches from the regular expression search then extract the operator and the value
if (m.find()) {
operator = m.group(1);
value = Integer.parseInt(m.group(2));
}
//Conditional to use the operator
if ("+".equals(operator)) {
rollingSum += value;
} else if ("-".equals(operator)) {
rollingSum -= value;
} else {
System.out.println("The operator provided wasn't a + or a -");
}
//Put the new rollingSum back into the globalMap
globalMap.put("rollingSum", rollingSum);
//Output the data
output_row.stock = rollingSum;
output_row.date = input_row.date;
There's quite a lot going on there but basically it starts by getting the current rollingSum from the globalMap.
Next, it uses a regular expression to split up the deltaStock string into an operator and a value. From this it uses the operator provided (plus or minus) to either add the deltaStock to the rollingSum or subtract the deltaStock from the rollingSum.
After this it then adds the new rollingSum back into the globalMap and outputs the 2 columns of stock and date (unchanged).
In my sample job I then output the data using a tLogRow which will print the values of the data to the console. I typically select the table formatting option in it and in this case I get the following output:
.-----+----.
|tLogRow_8 |
|=----+---=|
|stock|date|
|=----+---=|
|50 |J0 |
|130 |J1 |
|100 |J2 |
'-----+----'
Which should be what you were looking for.
You should be able to do it in Talend Open Studio.
I attach here an image with the JOB, the content of the tJavaRow and the execution result.
I left under the tFixedFlowInput used to simulate the input a tJDBCInput that you should use to read the data from your DB. Hopefully you can use a specific tXXXInput for your DB instead of the generic JDBC one.
Here is some simple code in the tJavaRow.
//Code generated according to input schema and output schema
output_row.delta = input_row.delta;
output_row.date = input_row.date;
output_row.rollingSum =
Integer.parseInt(globalMap.get("rollingSum").toString());
int delta = Integer.parseInt(input_row.delta);
output_row.rollingSum += delta;
// Save rolling SUM for next round
globalMap.put("rollingSum", output_row.rollingSum);
Beware of the exceptions in the parseInt(). You should handle them the way you feel right.
In my projects I usually have a SafeParse library that does not throws exceptions but returns a default value I can pass together with the vale to be parsed.
I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.