How can I pipe a tar compression operation to aws s3 cp? - bash

I'm writing a custom backup script in bash for personal use. The goal is to compress the contents of a directory via tar/gzip, split the compressed archive, then upload the parts to AWS S3.
On my first try writing this script a few months ago, I was able to get it working via something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 - /mnt/SCRATCH/backup.tgz.part
aws s3 sync /mnt/SCRATCH/ s3://backups/ --delete
rm /mnt/SCRATCH/*
This worked well for my purposes, but required /mnt/SCRATCH to have enough disk space to store the compressed directory. Now I wanted to improve this script to not have to rely on having enough space in /mnt/SCRATCH, and did some research. I ended up with something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter "aws s3 cp - s3://backups/backup.tgz.part" -
This almost works, but the target filename on my S3 bucket is not dynamic, and it seems to just overwrite the backup.tgz.part file several times while running. The end result is just one 100MB file, vs the intended several 100MB files with endings like .part0001.
Any guidance would be much appreciated. Thanks!

when using split you can use the env variable $FILE to get the generated file name.
See split man page:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
For your use case you could use something like the following:
--filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE'
(the single quotes are needed, otherwise the environment variable substitution will happen immediately)
Which will generate the following file names on aws:
backup.tgz.partx0000
backup.tgz.partx0001
backup.tgz.partx0002
...
Full example:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE' -

You should be able to get it done quite easily and in parallel using GNU Parallel. It has the --pipe option to split the input data into blocks of size --block and distribute it amongst multiple parallel processes.
So, if you want to use 100MB blocks and use all cores of your CPU in parallel, and append the block number ({#}) to the end of the filename on AWS, your command would look like this:
tar czf - something | parallel --pipe --block 100M --recend '' aws s3 cp - s3://backups/backup.tgz.part{#}
You can use just 4 CPU cores instead of all cores with parallel -j4.
Note that I set the "record end" character to nothing so that it doesn't try to avoid splitting mid-line which is its default behaviour and better suited to text file processing than binary files like tarballs.

Related

How to tar files with a size limit and write to a remote location?

I need to move large number of files to to S3 with the time-stamps intact (c-time, m-time etc need to be intact => I cannot use the aws s3 sync command) - for which I use the following command:
sudo tar -c --use-compress-program=pigz -f - <folder>/ | aws s3 cp - s3://<bucket>/<path-to-folder>/
When trying to create a tar.gz file using the above command --- for a folder that is 80+GB --- I ran into the following error:
upload failed: - to s3://<bucket>/<path-to-folder>/<filename>.tar.gz An error occurred (InvalidArgument) when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive
Upon researching this --- I found that there is a limit of 68GB for tar files (size of file-size-field in the tar header).
Upon further research - I also found a solution (here) that shows how to create a set of tar.gz files using split:
tar cvzf - data/ | split --bytes=100GB - sda1.backup.tar.gz.
that can later be untar with:
cat sda1.backup.tar.gz.* | tar xzvf -
However - split has a different signature:
split [OPTION]... [FILE [PREFIX]]
...So - the obvious solution :
sudo tar -c --use-compress-program=pigz -f - folder/ | split --bytes=20GB - prefix.tar.gz. | aws s3 cp - s3://<bucket>/<path-to-folder>/
...will not work - since split uses the prefix as a string and writes the output to a file with that set of names.
Question is: Is there a way to code this such that I an effectively use a pipe'd solution (ie., not use additional disk-space) and yet get a set of files (called prefix.tar.gz.aa, prefix.tar.gz.ab etc) in S3?
Any pointers would be helpful.
--PK
That looks like a non-trivial challenge. Pseudo-code might look like this:
# Start with an empty list
list = ()
counter = 1
foreach file in folder/ do
if adding file to list exceeds tar or s3 limits then
# Flush current list of files to S3
write list to tmpfile
run tar czf - --files-from=tmpfile | aws s3 cp - s3://<bucket>/<path-to-file>.<counter>
list = ()
counter = counter + 1
end if
add file to list
end foreach
if list non-empty
write list to tmpfile
run tar czf - --files-from=tmpfile | aws s3 cp - s3://<bucket>/<path-to-file>.<counter>
end if
This uses the --files-from option of tar to avoid needing to pass individual files as command arguments and running into limitations there.

Ruby output as input for system command

I am trying download a ton of files via gsutil (Google Cloud). You can pass a list of URLs to download:
You can pass a list of URLs (one per line) to copy on stdin instead of as command line arguments by using the -I option. This allows you to use gsutil in a pipeline to upload or download files / objects as generated by a program, such as:
some_program | gsutil -m cp -I gs://my-bucket
How can I do this from Ruby, from within the program I mean? I tried to output them but that doesn't seem to work.
urls = ["url1", "url2", "url3"]
`echo #{puts urls} | gsutil -m cp -I gs://my-bucket`
Any idea?
A potential workaround would be to save the URLs in a file and use cat file | gsutil -m cp -I gs://my-bucket but that feels like overkill.
Can you try echo '#{urls.join("\n")}'
If you put puts it returns nil, rather than the string you want to return. The interpolation fails due to the same reason.

Split and copy a file from a bucket to another bucket, without downloading it locally

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.
I'm looking for something with the same final behaviour as the following commands :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally
split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_
gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/
But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).
The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.
I took a look at the docs about streaming uploads and downloads using gsutil to do something like :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...
But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).
This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,
gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

Using GNU Parallel for cluster computing over LAN with rsync

I have two machines, and I want to use GNU Parallel to have multiple processes 'cat' the contents of some text files from both machines.
I have the following setup.
On a local machine, in the same directory, I have the following files:
cmd.sh - a bash file with contents: 'cat "$#"'
test1.txt - a text file with contents: 'Test 1'
test2.txt - a text file with contents: 'Test 2'
test3.txt - a text file with contents: 'Test 3'
nodefile - a text file with the following contents:
2/:
4/ dan#192.168.0.3
This is if I am using the nodefile example from wordpress link (below), and my IP is 192.168.0.2.
None of these files are replicated on the remote machine. I want to have multiple processes 'cat' the contents of each of the test?.txt files from both machines.
Preferably, this:
Wouldn't leave any artifacts on the remote machine
Would leave the contents of the local directory intact.
I have been able to execute multiprocessing commands remotely with the nodefile as per this wordpress example, but none involving file echoing remotely.
So far, I have something like the following:
parallel --sshloginfile nodefile --workdir . --basefile cmd.sh -a cmd.sh --trc ::: test1.txt test2.txt test3.txt
But this isn't working and is removing the files from my directory and not replacing them, as well as giving rsync errors. I (unfortunately) can't provide the errors at the moment, or replicate the setup.
I am very inexperienced with parallel, can anyone guide me on the syntax to accomplish this task? I haven't been able to find the answer (so far) in the man pages or on the web.
Running Ubuntu 16.04 LTS and using latest version of GNU Parallel.
You make a few mistakes:
-a is used to give an input source. It is basically an alias for ::::
you do not give the command to run after the options to GNU Parallel and before the :::
--trc takes an argument (namely the file to transfer back). You do not have a file to transfer back, so use --transfer --cleanup instead.
So:
chmod +x cmd.sh
parallel --sshloginfile nodefile --workdir . --basefile cmd.sh --transfer --cleanup ./cmd.sh ::: test1.txt test2.txt test3.txt
It is unclear if you want to transfer anything to the remote machine, so maybe this is really the correct answer:
parallel --sshloginfile nodefile --nonall --workdir . ./cmd.sh test1.txt test2.txt test3.txt

Can s3cmd be used to download a file and upload to s3 without storing locally?

I'd like to do something like the following but it doesn't work:
wget http://www.blob.com/file | s3cmd put s3://mybucket/file
Is this possible?
Cannot speak for s3cmd but its definitely possible.
You can use https://github.com/minio/mc . Minio Client aka mc is written in Golang, released under Apache License Version 2.
It implements mc pipe command for users to stream data directly to Amazon S3 from an incoming data on a pipe/os.stdin. mc pipe can also pipe to multiple destinations in parallel. Internally mc pipe streams the output and does multipart upload in parallel.
$ mc pipe
NAME:
mc pipe - Write contents of stdin to files. Pipe is the opposite of cat command.
$ mc cat
NAME:
mc cat - Display contents of a file.
Example
#!/bin/bash
mc cat https://s3.amazonaws.com/mybucket/1.txt | mc pipe https://s3-us-west-2.amazonaws.com/mywestbucket/1.txt
To answer the question regarding s3cmd: No, it can not (currently) read from STDIN.
It does support multi-part-upload and also streams to STDIN, but apparently not the other way around.
Piping output from s3cmd works like this:
s3cmd get s3://my-bucket/some_key - | gpg -d | tar -g /dev/null -C / -xvj
Please be aware that there may be an issue with streaming gzip files: https://github.com/s3tools/s3cmd/issues/811

Resources