get the running upload on s3 service using awssdk - spring-boot

I'm using S3 from aws, when i upload a file, I would like to keep all the uploads in progress, so that after a certain number of uploads in parallel, the user cannot transfer any more.
get running upload
if runningUpload > maxAuthorizeUpload then stop
else upload to s3
I've no idea how can I check current upload

You can use ThreadPoolTaskExecutor for executing multithread tasks, each per upload and check the active number of threads with getActiveCount() method, if this reach maxAuthorizedUpload, don´t add new thread for that upload.

Related

How to check and download files from S3 bucket in a specific interval?

I want to implement the following once a file is uploaded to aS3 Bucket
Download the file to a windows server
Run a 3rd party exe to process the file and generate an output file on a Windows Server
What is the best approach to implement this using .Net Core?
Solution 1:
Create a Lambda function to Trigger an API
API will download the file and process
Solution 2:
Create an executable to download the file from s3 bucket
Create a lambda function trigger an executable
Solution 3:
Create a service to check and download files from s3 bucket
The downloaded file will be processed by the service
Solution 4:
Use AWS Lambda to push the file to SQS
Create an application to monitor SQS.
Please let me know the best solution to implement this. Sorry for asking this non-technical question.
The correct architecture approach would be:
Create a trigger on the Amazon S3 bucket that sends a message to an Amazon SQS queue when the object is created
A Windows server is continually polling the Amazon SQS queue waiting for a message to appear
When a message appears, use the information in the message to download the object from S3 and process the file
Upload the result to Amazon S3 and optionally send an SQS message to signal completion (depending on what you wish to do after a file is processed)
This architecture is capable of scaling to large volumes and allows files to be processed in parallel and even across multiple servers. If a processing task fails and does not signal completion, then Amazon SQS will make the message visible again for processing.

Laravel Lumen directly Download and Extract ZIP file to Google Cloud Storage

My goal is to download a large zip file (15 GB) and extract it to Google Cloud using Laravel Storage (https://laravel.com/docs/8.x/filesystem) and https://github.com/spatie/laravel-google-cloud-storage.
My "wish" is to sort of stream the file to Cloud Storage, so I do not need to store the file locally on my server (because it is running in multiple instances, and I want to have the disk size as small as possible).
Currently, there does not seem to be a way to do this without having to save the zip file on the server. Which is not ideal in my situation.
Another idea is to use a Google Cloud Function (eg with Python) to download, extract and store the file. However, it seems like Google Cloud Functions are limited to a max timeout of 9 mins (540 seconds). I don't think that will be enough time to download and extract 15GB...
Any ideas on how to approach this?
You should be able to use streams for uploading big files. Here’s the example code to achieve it:
$disk = Storage::disk('gcs');
$disk->put($destFile, fopen($sourceZipFile, 'r+'));

How best to queue file uploads to S3 in a multi server environment?

In short, my API will accept file uploads. I (ultimately) store them in S3, but to save uploading them to S3 on the same request, I queue the upload process and do it in the background.
I was originally storing the file on the server, and in my job, I was queueing the file path, and then grabbing the contents with that file path on the server, and then sending to S3.
I develop/stage on a single server. My production environment will sit behind a load balancer, with 2-3 servers. I realised that my jobs will fail 2/3 of the time as the file that I am linking to in my job may be on a different server and not on the server running the job.
I realised I could just base64_encode the file contents, and just store that in Redis (as opposed to just storing the path of the file). Using the following:
$contents = base64_encode(file_get_contents($file));
UploadFileToCloud::dispatch($filePath, $contents, $lead)->onQueue('s3-uploads');
I have quite a large Redis store, so I am confident I can do this for lots of small files (most likely in my case), but some files can be quite large.
I am starting to have concerns that I may run into issues using this method, most likely issues to do with my Redis store running out of memory.
I have thought about a shared drive between all my instances and revert back to my original method of storing the file path, but unsure.
Another issue I have is if a file upload fails, if it's a big file, can the failed_jobs table handle the amount of data (for example) of a base64 encoded 20mb pdf.
Is base64 encoding the file contents and queuing that the best method? Or, can anyone recommend an alternative means to queue uploading a file upload in a multi server environment?

Spring batch job start processing file not fully uploaded to the SFTP server

I have a spring-batch job scanning the SFTP server at a given interval. When it finds a new file, it starts the processing.
It works fine for most cases, but there is one case when it doesn't work:
User starts uploading a new file to the SFTP server
Batch job checks the server and finds a new file
It start processing it
But since the file is still being uploaded, during the processing it encounters unexpected end of input block, and the error occurs.
How can I check that file was fully uploaded to the SFTP server before batch job processing starts?
Locking files while uploading / Upload to temporary file name
You may have an automated system monitoring a remote folder and you want to prevent it from accidentally picking a file that has not finished uploading yet. As majority of SFTP and FTP servers (WebDAV being an exception) do not support file locking, you need to prevent the automated system from picking the file otherwise.
Common workarounds are:
Upload “done” file once an upload of data files finishes and have
the automated system wait for the “done” file before processing the
data files. This is easy solution, but won’t work in multi-user
environment.
Upload data files to temporary (“upload”) folder and move them atomically to target folder once the upload finishes.
Upload data files to distinct temporary name, e.g. with .filepart extension, and rename them atomically once the upload finishes. Have the automated system ignore the .filepart files.
Got from here
We had similar problem, Our solution was, we configured spring-batch cron trigger to trigger the job every 10min(though we could configure for 5min, as file transfer was taking less than 3min), then we read/process all the files created prior to 10 minutes. We assume the FTP operation completes within 3 minutes. This gave us some additional flexibility such as when spring-batch app was down etc.
For example if the batch job triggered at 10:20AM we read all the files that were created before 10:10AM, like-wise job that runs at 10:30, reads all the files created before 10:20.
Note: Once Read you need to either delete or move to history folder for duplicate reads.

Update extension for multiple files at once on Amazon S3

I'm having about 1 million files on my S3 bucket and unfortunately these files were uploaded with wrong extension. I need to add a '.gz' extension to every file in that bucket.
I can manage do that by using aws cli:
aws s3 mv bucket_name/name_1 bucket_name/name_1.gz
This works fine but the script is running so slow since it moves the file one by one, in my calculation it'll take up to 1 week, which is not acceptable.
I wonder if we have any better and faster way to achieve this goal ?
You can try S3 Browser which supports multi thread calls.
http://s3browser.com/
I suspect other tools can do multi thread as well, but the CLI doesn't.
There's no renaming feature for S3 files/bucket so you need to move or copy/delete files. If the files are big, it can indeed be a bit slow.
However there's nothing that prevents you to wait for a request to complete to continue with "renaming" the next file in your list, just process it.

Resources