I have the job which is being scheduled everyday.
The job functionality is below:
Reader will read the data from database using JDBCPaging reader.
ItemProcessor will process the data by making an API call which returns some updated data.
Writer writes the data back into database.
The problem is there is an online processing which reads a particular row, process and update it.
I want to maintain consistency such that I want to update the data which is last processed.
Since the reader, processor and write is done in separate method, how to take the lock and process it.
I am using postgresql database.
Related
ItemReader is reading from ever growing existing table everytime the job runs. I am looking for option within Spring batch to query only the new records every time the schedule job runs.
If I have 50000 records read, next schedule should start from 50001.
My thought is to persist the id of the last record read by ItemReader ( last of the whole reader output and not the last of each chunk ) in DB and use in subsequent job schedule. I will return the data sorted by id from main table.
How do I know the last record in the writer? Any ideas.
I would make it explicit by passing the ID range of the records (ie. fromId and toId) that are required to be processed as the job parameters when running a batch job. Then in the ItemReader, you can rely on this ID range to return the data to process.
And somehow persist the the latest ID that is already processed to the DB (e.g. by using JobExecutionListener when a job finished) . When the next scheduled job triggers, find out the next ID that is not processed and then start another job instance with this next ID as the parameter.
I read everywhere how to read data in spring batch itemReader and write in database using itemWriter, but I wanted to just read data using spring batch then somehow I wanted to access this list of items outside the job. I need to perform remaining processing after job finished.
The reason I wanted to do this is because I need to perform a lot of validations on every item. I have to validate each item's variable xyz if it exists in list(which is not available within job). After performing a lot of processing I have to insert information in different tables using JPA. Please help me out!
I want to run some cron schedule task in my app. I have figured out the code, but I'm not sure where to put it. Should I put all code into job file? The code performs such tasks:
Retrieve XML data from 3rd party websites.
Do some filtering and add data to array.
Push data from array to database.
Code is performed every hour.
You can create some kind of a parser for XML data retrieval, and store this class into the lib directory. Other stuff like filtering and pushing data from database - you can implement this directly in the job. As the job grows, you will see what can be moved elsewhere or extended.
I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.
I have a map-reduce job in which the mapper is in charge of clustering data records. When a data record is read, I add it to a list. How to know when all the data records are read that I can start clustering the list?
The Mapper interface provides a cleanup method that gets called when a task is finished. You could use that as a hook to trigger whatever additional logic you need to perform with the list of objects. I have to ask though, why not use a Reducer task to perform this processing?