Update extension for multiple files at once on Amazon S3 - ruby

I'm having about 1 million files on my S3 bucket and unfortunately these files were uploaded with wrong extension. I need to add a '.gz' extension to every file in that bucket.
I can manage do that by using aws cli:
aws s3 mv bucket_name/name_1 bucket_name/name_1.gz
This works fine but the script is running so slow since it moves the file one by one, in my calculation it'll take up to 1 week, which is not acceptable.
I wonder if we have any better and faster way to achieve this goal ?

You can try S3 Browser which supports multi thread calls.
http://s3browser.com/
I suspect other tools can do multi thread as well, but the CLI doesn't.

There's no renaming feature for S3 files/bucket so you need to move or copy/delete files. If the files are big, it can indeed be a bit slow.
However there's nothing that prevents you to wait for a request to complete to continue with "renaming" the next file in your list, just process it.

Related

Laravel Lumen directly Download and Extract ZIP file to Google Cloud Storage

My goal is to download a large zip file (15 GB) and extract it to Google Cloud using Laravel Storage (https://laravel.com/docs/8.x/filesystem) and https://github.com/spatie/laravel-google-cloud-storage.
My "wish" is to sort of stream the file to Cloud Storage, so I do not need to store the file locally on my server (because it is running in multiple instances, and I want to have the disk size as small as possible).
Currently, there does not seem to be a way to do this without having to save the zip file on the server. Which is not ideal in my situation.
Another idea is to use a Google Cloud Function (eg with Python) to download, extract and store the file. However, it seems like Google Cloud Functions are limited to a max timeout of 9 mins (540 seconds). I don't think that will be enough time to download and extract 15GB...
Any ideas on how to approach this?
You should be able to use streams for uploading big files. Here’s the example code to achieve it:
$disk = Storage::disk('gcs');
$disk->put($destFile, fopen($sourceZipFile, 'r+'));

How to correctly dockerize and continuously integrate 20GB raw data?

I have an application that uses about 20GB of raw data. The raw data consists of binaries.
The files rarely - if ever - change. Changes only happen if there are errors within the files that need to be resolved.
The most simple way to handle this would be to put the files in its own git repository and create a base image based on that. Then build the application on top of the raw data image.
Having a 20GB base image for a CI pipeline is not something I have tried and does not seem to be the optimal way to handle this situation.
The main reason for my approach here ist to prevent extra deployment complexity.
Is there a best practice, "correct" or more sensible way to do this?
Huge mostly-static data blocks like this are probably the one big exception to me to the “Docker images should be self-contained” rule. I’d suggest keeping this data somewhere else, and download it separately from the core docker run workflow.
I have had trouble in the past with multi-gigabyte images. Operations like docker push and docker pull in particular are prone to hanging up on the second gigabyte of individual layers. If, as you say, this static content changes rarely, there’s also a question of where to put it in the linear sequence of layers. It’s tempting to write something like
FROM ubuntu:18.04
ADD really-big-content.tar.gz /data
...
But even the ubuntu:18.04 image changes regularly (it gets security updates fairly frequently; your CI pipeline should explicitly docker pull it) and when it does a new build will have to transfer this entire unchanged 20 GB block again.
Instead I would put them somewhere like an AWS S3 bucket or similar object storage. (This is a poor match for source control systems, which (a) want to keep old content forever and (b) tend to be optimized for text rather than binary files.). Then I’d have a script that runs on the host that downloads that content, and then mount the corresponding host directory into the containers that need it.
curl -LO http://downloads.example.com/really-big-content.tar.gz
tar xzf really-big-content.tar.gz
docker run -v $PWD/really-big-content:/data ...
(In Kubernetes or another distributed world, I’d probably need to write a dedicated Job to download the content into a Persistent Volume and run that as part of my cluster bring-up. You could do the same thing in plain Docker to download the content into a named volume.)

Downloading and Transferring Files Simultaneously with Limited EC2 Storage

I am using an EC2 instance in AWS to run a bash script that downloads files from a server using a CLI while simultaneously moving them into S3 using the AWS CLI (aws s3 mv). However, I usually run out of storage before I can do this because the download speeds are faster than the transfer speeds to S3. The files, which are downloaded daily, are usually hundreds of GB and I do not want to upgrade storage capacity if at all possible.
The CLI I am using for the downloads runs continuously until success/fail but outputs statuses to the console (when I run it from command line instead of .sh) as it goes. I am looking for a way to theoretically run this script based on the specifications given. My most recent attempt was to use something long the lines of:
until (CLI_is_downloading) | grep -m 1 "download complete"; do aws s3 mv --recursive path/local_memory s3://path/s3; done
But that ran out of memory and the download failed well before the move was finished.
Some possible solutions that I thought of are to somehow run the download CLI until I reach a certain point of memory available before switching to the transfer and then alternating back and forth. Also, I am not too experienced with AWS so I am not sure this would work, but could I limit the download speed to match the transfer speed (like network throttling)? Any advice on the practicality of my ideas or other suggestions on how to implement this would be greatly appreciated.
EDIT: I checked my console output again and it seems that the aws s3 mv --recursive only moved the files that were currently there when the function was first called and then stopped. I believe if I called it repeatedly until I got my "files downloaded" message from my other CLI command, it might work. I am not sure exactly how to do this yet so suggestions would still be appreciated but otherwise, this seems like a job for tomorrow.

Shell Script for file monitoring

I have 2 AWS EC2 LAMP servers and i want to replicate the data on one of the folders to others. I know I can try with EFS, but for some reason it is not a viable option at this moment. So, here is what I want to request for help:
Our Server A and Server B has same file structure but the files inside are mismatch. So, I want a script in Server A to look in, example, /var/www/html/../file/ folder and compare with /var/www/html/../file/ in Server B, and dump all new files from Server A to B.
Any help on how to write it?
Well, I used S3FS which is lot easier than breaking head over the script. It readily copies the files from one server to another.

Golang file and folder replication / mirroring across multiple servers

Consider this scenario. In a load-balanced environment, I have 3 separate instances of a CMS running on 3 different physical servers. These 3 separate running instances of the application is sharing the same database.
On each server, the CMS has a /media folder where all media subfolders and files reside. My question is how I'd implement/code a file replication service/functionality in Golang, so when a subfolder or file is added/changed/deleted on one of the servers, it'll get copied/replicated/deleted on all other servers?
What packages would I need to look in to, or perhaps you have a small code snippet to help me get started? That would be awesome.
Edit:
This question has been marked as "duplicate", but it is not. It is however an alternative to setting up a shared network file system. I'm thinking that keeping a copy of the same file on all servers, synchronizing and keeping them updated might be better than sharing them.
You probably shouldn't do this. Use a distributed file system, object storage (ala S3 or GCS) or a syncing program like btsync or syncthing.
If you still want to do this yourself, it will be challenging. You are basically building a distributed database and they are difficult to get right.
At first blush you could checkout something like etcd or raft, but unfortunately etcd doesn't work well with large files.
You could, on upload, also copy the file to every other server using ssh. But then what happens when a server goes down? Or what happens when two people update the same file at the same time?
Maybe you could design it such that every file gets a unique id (perhaps based on the hash of its contents so you can safely dedupe) and those files can never be updated or deleted, only added. That would solve the simultaneous update problem, but you'd still have the downtime problem.
One approach would be for each server to maintain an append-only version log when a file is added:
VERSION | FILE HASH
1 | abcd123
2 | efgh456
3 | ijkl789
With that you can pull every file from a server and a single number would be sufficient to know when a file is added. (For example if you think Server A is on version 5, and you get informed it is now on version 7, you know you need to sync 2 files)
You could do this with a database table:
ID | LOCAL_SERVER_ID | REMOTE_SERVER_ID | VERSION | FILE HASH
Which you could periodically poll and do your syncing via ssh or http between machines. If a server was down you could just retry until it works.
Or if you didn't want to have a centralized database for this you could use a library like memberlist. The local meta data for each node could be its version.
Either way there will be some amount of delay between a file was uploaded to a single server, and when it's available on all of them. Handling that well is hard, which is why you probably shouldn't do this.

Resources