Spring Batch, read whole csv file before reading line by line - spring

I want to read a csv file, enrich each row with some data from some other external system and then write the new enriched csv to some directory
Now to get the data from external system i need to pass each row one by one and get the new columns from external system.
But to query the external system with each row i need to pass a value which i have got from external system by sending all the values of a perticular column.
e.g - my csv file is -
name, value, age
10,v1,12
11,v2,13
so to enrich that i first need to fetch a value as per total age - i.e 12 + 13 and get the value total from external system and then i need to send that total with each row to external system to get the enriched value.
I am doing it using spring batch but using fLatFileReader i can read only one line at a time. How would i refer to whole column before that.
Please help.
Thanks

There are two ways to do this.
OPTION 1
Go for this option if you are okey to store all the records in memory. Totally depends how many record you need to calculate the total age.
Reader(Custom Reader) :
Write the logic to read one line at a time.
You need to return null from read() only when you feel all the lines are read for calculating the total age.
NOTE:- A reader will loop the read() method until it returns null.
Processor : You will get the full list of records. calculate the total age.
Connect the external system and get the value. Form the records which need to be written and return from the process method.
NOTE:- You can return all the records modified by a particular field or merge a single record. This is totally your choice what you would like to do.
Writer : Write the records.
OPTION 2
Go for this if option1 is not feasible.
Step1: read all the lines and calculate the total age and pass the value to the next step.
Step2: read all the lines again and update the records with required update and write the same.

Related

Apache Nifi, can I collect an attribute from multiple flow files

I have a nifi flow that takes in .csv files and partitions each into multiple records with each csv column value added as an attribute.
At one point in the flow, I'd like to collect the value of one attribute from each record that passes though. There could be from 0 to n collected. Once I have the list, it'll be emailed out.
I'm trying to avoid me (or someone else) getting bombed with emails if there are 200+ bad records in a file. So if I could collect for a fixed period of time or until another attribute (filename) changes, that would be great.
I've tried merge content and record. I even tried replace text to replace the content w/ just the attribute value I want to save and merging those, and a slew of other things.
Is there a simple way to do this in nifi?
Have you tried UpdateAttribute with a new attribute of type array. When each flowfile passes the this processor you could continue to update the value of this attribute by appending a new value to the array, attribute.
However, as #daggett pointed out, it will be helpful if you can provide the input and expected output.

Generate new format from a non-system generated report using Power Query

I have an excel file which is non-system generated report format.
I wish to calculate and generate another new output.
Given the Report format as below:-
1) Inside the query when load this excel file, how can I create a new column to copy and paste on the first found value (1#51) at column at the next record, if the next record is empty. Once, if detected a new value (1#261) then copy and paste to the subsequent null value of few next records till this end?
2) The final aim is to generate a new output to auto match/calculate the money to be assign to different reference. As shown below:-
The reference A ~ E is sharing the 3 bank Ref (28269,28542 & RMP) , was thinking to read the same data source a few times, first time to read the column A ~ O(QueryRef) and 2nd time to read the same source to read from A, Q ~ V(QueryBank).
After this I do not have idea how I can allocate the $$ from Query Bank to QueryRef based on the Sum of Total AR.
Eg,
Total Amt of BankRef 28269, $57,044.67 is sufficient to cover Ref#A $10,947.12
BankRef 28269 still sufficient to cover Ref#B $27,647.60
BankRef 28269 left only $18,449.95 , hence the balance of 28269 be allocate to Ref#C.
Remaining balance of Ref#C will need to use BankRef28542 to cover,i.e. $1,812.29
Ref#D will then be allocated of the remaining balance of BankRef28542, i.e. $4,595.32
Ref#D still left $13,350.03 unallocated, hence this will use BankRef#RMP
Ref#E only need $597.66, and BankRef#RMP is sufficient to cover this.
I am not sure if my above case study can be solved using power query or not, due to me still being a newbie # Power Query? Or this is too complicate to handle hence we need to write a program to auto matching this kinds of scenario?
Attached is the sample source file and output :
https://www.dropbox.com/sh/dyecwcdz2qg549y/AACzezsXBwAf8eHUNxxLD1eWa?dl=0
Any advice/opinion/guidance is very much appreciated.
Answering question one:
You have a feature in Powerquery called FILL, DOWN or UP.
For a selected column you can copy the first non empty value to all rows under until a new non empty row is found and so on.

Jmeter Multiple CSV files

I wanna use 2 CSV files. One CSV file with user id nd passwords.And second CSV file, containing data for adding different producers(agents) against each user.
I mean user id 1 should take 1st row of producer data, user id2 shall take 2nd row of producer data and so on. IS this possible in jmeter and how?
Put user 1, password 1 and other data in the same row which needs to be used. Make sure you CSV is having the headers and do not put any thing in the "Variable Names" of CSV Dataset config.
Now, when you run with multiple thread, 1st thread will pick 1st row and then move to the second row. Other threads will also do the same.
Hope it helps.
You can use multiple csv's but then you have to maintain login and other test data binding. So, it is better to put it in one.

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

Talend loop for each record

Hi i am designing a data generation job.
my job is something like this
tRowGenerate --> tMap --> tFileOutputDelimited.
Lets say my tRowGenerate produces 5 columns with 2 records. I want to iterate for this records i.e for each record I want to iterate certain number of times.
for record 1 iterate 5 times to produce further data.
for record 2 iterate 3 times to produce further data.
Please suggest how to apply this multiply by xi logic. where xi for each record can change.
Thanks!
If you want to loop on the data generated from the tRowGenerator you can use a tLoop where you put the call to your business rule to determine the number of loops or when stop looping.
An example job might look like:
Logic of flow:
row1 is a main connection taking the generated values to the tFlowtoIterate that stores them in global variables;
the iterate link activates the tLoop that can use the values stored in the global vars to activate your business rule (to have the number of loops or tho ask if continue or stop);
the tLoop activate the tJavaFlex that uses the stored global vars to produce the output you like and pass it to the tFileOutputDelimited with a main link (row2).
You have to activate the append flag on the tFileOutputDelimited to keep the data from the different loops. If you need you can add a tFileDelete at the beginning to empty the output file before a new processing round.

Resources