Spring Batch Metadata Issue - spring

When I am trying to disable Spring Batch Metadata creation with the option spring.batch.initialize-schema=never and then I launch batch, nothing happen and the batch terminate immediately without running the related jobs.
In the other hand when I am trying to enable the Metadata creation, the batch work fine, I am getting the classic SERIALIZED_CONTEXT field size error. I can't always save 4GB of data in the table when I execute the batch.
How to disable definitively the Metadata creation, and have my batch still working?
Edit : I think I found a kind of solution to avoid this issue, and I would like to have your point of view. I am finally working with Metadata generation. The issue occurs when you have large set of data stored in your ExecutionContext you pass between Tasklets (we all know this is the reason). In my case it is an ArrayList of elements (POJO), retrieved from a CSV file with OpenCSV. To overcome this issue I have :
reduced the number of columns and lines in the ArrayList (because Spring Batch will serialize this ArrayList in the SERIALIZED_CONTEXT field. The more columns and lines you have the more you are sure to get this issue)
changed the type of the SERIALIZED_CONTEXT from TEXT to LONGTEXT
deleted the toString() method defined in the POJO (not sure it really helps)
But I am still wondering, what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?

So this is not an issue with metadata generation but with passing a large amount of data between two steps.
what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?
You can still load all columns but you have to reduce the chunk size. The whole point of chunk processing in Spring Batch is to not load all data in memory. What you can do in your case is to carefully choose a chunk size that fits your requirement. There is no recipe for choosing the correct chunk size (since it depends on the number of columns, the size of each column, etc), so you need to proceed in an empirical way.

Related

Spring Batch Transaction using Chunk based processing

enter image description hereI am using chunk based processing using Spring Batch to read data in chunks from DB using JdbcPagingItemReader.
Now , I am killing the task in between the write stage of a chunk. Ideally the previous records in the chunk should have got rolled back but that did not happen.
The DB used here is DB2 .
The approach which I used was- to set Autocommit false for the connection and then after the write steps were complete I used the commit statement.This approach worked fine for a small set of data. But in real time there would be millions of records.
So , is this the right approach and if not then what can be the other solutions.
Thanks!
If I'm understanding you correctly: do you want to do thousands of processings without committing. And when everything is done committing?
Don't do so. You will have serious problems, both in the system and in the data base.
Things like that you better use another strategy, for example:
A temporary table. It goes on committing and when it finishes, it analyzes the result and if everything is right, it executes an update from the temporary to the final table.
You have to divide and conquer:
Run the process that generates the thousands of rows in a temporary table.
Analyze the result, it can be an analysis that results in the deletion of unsatisfactory lines or even an analysis that excludes the entire process.
Perform what should be done based on the analysis process.
I would create three or more Spring Batch Steps for each step of the one described above.

spring batch - how to avoid re-loading(writing) data that was loaded in the previous run

I have a basic spring batch app which is trying to load the data from a csv file to mysql. the program does load the file into db during the first run. However when I accidently re-run the job/app again, it had thrown the primary key violation (for the right reasons).
What is the best way to avoid reloading the data that is present on the target system? when the batch job is scheduled, if for any good reason, the source file has not changed since the previous run, I want to see 0 record processed message rather than a primary key violation error. hope it makes sense.
more information:
Thanks. I have probably not understood the answer. Let me explain my requirement in a better way. I have a file contains the data from external data source (say new hire data) with a fixed name of hire.csv. The file should be updated with the delta changes for every run. As there is a possibility of manual error of not removing all loaded rows, some new hires from previous run would also be present on current run. Is there a mechanism available within itemreader or itemprocessor to skip those records that are already present on the target db? I can do "insert into tb where not in (select from tb)" but this run for every row which I dont want to use. Hope it is clear now. thanks again.
However when I accidently re-run the job/app again, it had thrown the primary key violation (for the right reasons). What is the best way to avoid reloading the data that is present on the target system?
The file you are ingesting should be a (identifying) job parameter. This way, when the first run succeeds, the job instance is complete and cannot be run again. This is by design in Spring Batch for this very use case: preventing accidental job execution twice by error.
Edit: Add further options based on comments
If deleting the file is an option, then you can use a job listener or a final step to delete the file after ingesting it. With this option, you need to add a second identifying paramter (since the file name is always hire.csv) to make sure you have a different job instance for each run. This option does not require having a different file name for each run.
If the file can be renamed to hire-${timestamp}.csv and will be unique, then deleting the file after ingesting it and using a single job parameter with the filename is enough
Side note: I have seen people using a business key to identify records in the input file and using an item processor to query the database and filter items that have been already ingested. This works for small datasets but performs poorly with large datasets due to the additional query for each item.

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

How to pass a String with more than 250 characters as job parameter in Spring Batch?

In BATCH_JOB_EXECUTION_PARAMS table, column "STRING_VAL" is defined as varchar(250). If any string longer than 250 is passed as job parameter, the database will complain that data is too long. I did some research and what some people did was to manually change the definition of the column to hold more data. Is there any side effect to store large params in the table? If so, what is the best solution to pass a large job param?
Thanks.
There shouldn't be a side effect; especially, if it is a non-identifiying parameter.
But also then, the only place this could have a sideeffect is the generation of the "JOB_KEY"-field in the JOB_INSTANCE table (have a look at JdbcJobInstanceDao).
The content for this field is generated using a "JobKeyGenerator" and having a look at the used default implementation "org.springframework.batch.core.DefaultJobKeyGenerator", I don't see anything that could cause a side effect.
I would not go down that road since it is part of Spring Framework developed outside of your control. Even if it is safe now to change what if they decide to use 250 character limit in some important framework functionality. You will get either funny bugs when you upgrade to new version or you will get version lock since you changed library code yourself.
I answered similar question in this post. You can create new table for holding parameters or whatever near Spring Batch meta data (in same database) and you can pass just ID. Inside Spring Batch job you can pull whatever from that table based on passed ID.

How can I limit memory usage when generating a CSV from a large resultset?

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?
A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.
Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.
I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?
Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

Resources