I am new to PIG scripting. need some help on this issue.
I got two set of bag in pig and from there I want to get all the field from first bag and overwrite data of first bag if second bag has the data of same field
Column list are dynamic (columns may get added or deleted any time).
in set b we may get data in another field also which are currently blank, if so, then we need to overwrite set a with data available in set b
columns - uniqueid,catagory,b,c,d,e,f,region,g,h,date,direction,indicator
EG:
all_data= COGROUP a by (uniqueid), b by (uniqueid);
Output:
(1,{(1,test,,,,,,,,city,,,,,2020-06-08T18:31:09.000Z,west,,,,,,,,,,,,,A)},{(1,,,,,,,,,,,,,,2020-09-08T19:31:09.000Z,,,,,,,,,,,,,,N)})
(2,{(2,test2,,,,,,,,dist,,,,,2020-08-02T13:06:16.000Z,east,,,,,,,,,,,,A)},{(2,,,,,,,,,,,,,,2020-09-08T18:31:09.000Z,,,,,,,,,,,,,,N)})
Expected Result:
(1,test,,,,,,,,city,,,,,2020-09-08T19:31:09.000Z,west,,,,,,,,,,,,,N)
(2,test2,,,,,,,,dist,,,,,2020-09-08T18:31:09.000Z,east,,,,,,,,,,,,N)
I was able to achieve expected output with below
final = FOREACH all_data GENERATE flatten($1),flatten($2.(region)) as region ,flatten($2.(indicator)) as indicator;
I am very new to Informatica PowerCenter, Just started learning. Looking for help. My requirement is : I have to extract data from flat file(CSV file) and store the data into Oracle Table. Some of the column value of the target table should be coming from extracting file name.
For example:
My Target Table is like below:
USER_ID Program_Code Program_Desc Visit Date Term
EACRP00127 ER Special Visits 08/02/2015 Aug 2015
My input filename is: Aug 2015 ER Special Visits EACRP00127.csv
From this FileName I have to extract "AUG 2015" as Term, "ER Special Visits" as Program_Desc and "EACRP00127" as Program_Code along with some other fields from the CSV file.
I have found one solution using "Currently Processed Filename". But with this I am able to get one single value from filename. how can I extract 3 values from the filename and store in the target table? Looking for some shed of light towards solution. Thank you.
Using expression transformation you can create three output values from Currently Processed Filename column.
So you get the file name from SQ using this field 'Currently Processed Filename'. Then you can substring the whole string to get what you want.
input/output = Currently Processed Filename
o_Term = substr(Currently Processed Filename,1,9)
o_Program_Desc = substr(Currently Processed Filename,10,18)
o_Program_Code = substr(Currently Processed Filename,28,11)
I have a csv flowfile with single record. I need to create its file name based on couple of column values in the csv file. Can you please let me know how we can do it by using the column name only not the position of the column as column position may change. Example
CSV File
Name , City, State, Country, Gender
John, Dallas, Texas, USA, M
File name should be John_USA.csv
I am trying extract text processor and pulling the first data row using -
row = ^.\r?\n(.)
And then updateattribute processor I am pulling the values from the columns using below expression
${row:getDelimitedField(1)}_${row:getDelimitedField(4)}.csv
But this use the position of the column not the column name. How can I build it using the column name not the position of columns
The way I will do it (maybe be not the efficient one):
Convert the CSV to json
Pass content to attributes (so you can access the field you want like dictionnary (key-value))
Update Attributes
Convert it back to CSV (thus you can control the schema, and the position of the fields).
To count the rows of csv file we can use Get Files Rows Count Input in etl. How to find the number columns of a csv file?
Just read the first row of the CSV file using Text-File-Input setting header rows to 0. Usually, the first row contains field names. If you read the whole row into a single field, you can use Split-Field-To-Rows to have a single fieldname per row and the number of rows tells you the number of fields. There are other ways, but this one easily prepares for a subsequent metadata injection - if that's what you have in mind.
No Need of Metadata injection , In Split-Field-To-Rows, check "Include rownum in output" and give some name to that Variable. Then apply sort rows on that Variable, use Sample rows, then you will get number of fields which are present in the file.