Spring batch job start processing file not fully uploaded to the SFTP server - spring

I have a spring-batch job scanning the SFTP server at a given interval. When it finds a new file, it starts the processing.
It works fine for most cases, but there is one case when it doesn't work:
User starts uploading a new file to the SFTP server
Batch job checks the server and finds a new file
It start processing it
But since the file is still being uploaded, during the processing it encounters unexpected end of input block, and the error occurs.
How can I check that file was fully uploaded to the SFTP server before batch job processing starts?

Locking files while uploading / Upload to temporary file name
You may have an automated system monitoring a remote folder and you want to prevent it from accidentally picking a file that has not finished uploading yet. As majority of SFTP and FTP servers (WebDAV being an exception) do not support file locking, you need to prevent the automated system from picking the file otherwise.
Common workarounds are:
Upload “done” file once an upload of data files finishes and have
the automated system wait for the “done” file before processing the
data files. This is easy solution, but won’t work in multi-user
environment.
Upload data files to temporary (“upload”) folder and move them atomically to target folder once the upload finishes.
Upload data files to distinct temporary name, e.g. with .filepart extension, and rename them atomically once the upload finishes. Have the automated system ignore the .filepart files.
Got from here

We had similar problem, Our solution was, we configured spring-batch cron trigger to trigger the job every 10min(though we could configure for 5min, as file transfer was taking less than 3min), then we read/process all the files created prior to 10 minutes. We assume the FTP operation completes within 3 minutes. This gave us some additional flexibility such as when spring-batch app was down etc.
For example if the batch job triggered at 10:20AM we read all the files that were created before 10:10AM, like-wise job that runs at 10:30, reads all the files created before 10:20.
Note: Once Read you need to either delete or move to history folder for duplicate reads.

Related

Apache NiFi - How to pull all files thru GetSFTP processor only if a particular text file is available else ignore the files

Will be having many files daily, but need to pull them only if particular text file is in the list (which indicates all files are ready to pull), through GetSFTP Processor.
This process involved pulling files from SFTP and copying to aws-s3.
I know an alternate process to write a script and pull them through the script but I am looking to achieve the same with processors without a script.

Is there anyway to avoid processing same file twice with Spring Batch?

I am working on the 3 steps Spring Batch project. Firstly, it downloads needed text files from ftp to local, then process it, and finally delete files in the local directory every 10 minutes. And every 10 minutes there are new files loaded in the FTP. What if there emerge some problem in the FTP and it does not load new files? Then Spring Batch project download same file and process it again. So my question is that how can avoid Spring Batch to process same file twice?
Edit: I have used Apache common library to download files from FTP.
And I am using MultiResourceItemReader to pull 2 text files at each run.
I would use the file name as a job parameter. This will create a job instance for each file.
Now since Spring Batch prevents running the same job instance to completion more than once, then each file would be processed only once and you could avoid processing the same file twice by design.

How best to queue file uploads to S3 in a multi server environment?

In short, my API will accept file uploads. I (ultimately) store them in S3, but to save uploading them to S3 on the same request, I queue the upload process and do it in the background.
I was originally storing the file on the server, and in my job, I was queueing the file path, and then grabbing the contents with that file path on the server, and then sending to S3.
I develop/stage on a single server. My production environment will sit behind a load balancer, with 2-3 servers. I realised that my jobs will fail 2/3 of the time as the file that I am linking to in my job may be on a different server and not on the server running the job.
I realised I could just base64_encode the file contents, and just store that in Redis (as opposed to just storing the path of the file). Using the following:
$contents = base64_encode(file_get_contents($file));
UploadFileToCloud::dispatch($filePath, $contents, $lead)->onQueue('s3-uploads');
I have quite a large Redis store, so I am confident I can do this for lots of small files (most likely in my case), but some files can be quite large.
I am starting to have concerns that I may run into issues using this method, most likely issues to do with my Redis store running out of memory.
I have thought about a shared drive between all my instances and revert back to my original method of storing the file path, but unsure.
Another issue I have is if a file upload fails, if it's a big file, can the failed_jobs table handle the amount of data (for example) of a base64 encoded 20mb pdf.
Is base64 encoding the file contents and queuing that the best method? Or, can anyone recommend an alternative means to queue uploading a file upload in a multi server environment?

Spring batch integration file lock access

I have a spring batch integration where multiple servers are polling a single file directory. This causes a problem where a file can be processed up by more than one. I have attempted to add a nio-lock onto the file once a server has got it but this locks the file for processing so it can't read the contents of the file.
Is there a spring batch/integration solution to this problem or is there a way to rename the file as soon as it is picked up by a node?
Consider to use FileSystemPersistentAcceptOnceFileListFilter with the shared MetadataStore: http://docs.spring.io/spring-integration/reference/html/system-management-chapter.html#metadata-store
So, only one instance of your application will be able to pick up a file.
Even if we find a solution for nio-lock, you should understand that lock means "do not touch until freed". Therefore when one instance has done its work, another one is ready to pick up the file. I guess that isn't your goal.

What happen if i trigger a transference in WebSphere MQ FTE but the folder is contantly receiving new files

I want to know what happen if i program a monitor to trigger a transference anytime a trigger file is found in x directory and transfer all the .txt files in x folder, what happen if this directory receive other files after the trigger file is created? are they send in the same transference? or will be send in another one?
Thanks for your help in advance
It depends on the timings between when the agent begins processing the transfer request submitted by the monitor and when the extra files are added to the directory that contains the source files to be transferred.
As an example, let's say you monitor directory x to match on the trigger file, "trigger.file". When this file is detected by a poll of the resource monitor, it submits a managed transfer request to the agent that specifies "*.txt" as the source file located in directory x also. In other words, the managed transfer request submitted will transfer any file ending in .txt in directory x (because of the wildcard).
Now, imagine the following timeline of events:
Two .txt files (file1.txt, file2.txt) are added to directory x.
The trigger file (trigger.file) is then subsequently created directory x.
The resource monitor polls, detects the file "trigger.file" which matches the resource monitors trigger conditions.
The resource monitor then submits a managed transfer request to the agent.
Before the agent processes this request, a new .txt file is added to directory x (file3.txt).
The agent then starts to process the managed transfer request and needs to expand the wildcard source file specification (*.txt) in a concrete list of files. So it lists directory x and picks out the files ending in .txt. At this point there are three files (file1.txt, file2.txt and file3.txt) that are included in the transfer, even though file3.txt was created after the resource monitor triggered when the trigger file was detected.
Once the wildcard has been expanded and the concrete list of files determined, any new .txt file (e.g., file4.txt) will not be transferred until the trigger file is updated / replaced causing the resource monitor to trigger again.
I hope this helps! If you need any further clarification, feel free to ask.

Resources