Can s3cmd be used to download a file and upload to s3 without storing locally? - s3cmd

I'd like to do something like the following but it doesn't work:
wget http://www.blob.com/file | s3cmd put s3://mybucket/file
Is this possible?

Cannot speak for s3cmd but its definitely possible.
You can use https://github.com/minio/mc . Minio Client aka mc is written in Golang, released under Apache License Version 2.
It implements mc pipe command for users to stream data directly to Amazon S3 from an incoming data on a pipe/os.stdin. mc pipe can also pipe to multiple destinations in parallel. Internally mc pipe streams the output and does multipart upload in parallel.
$ mc pipe
NAME:
mc pipe - Write contents of stdin to files. Pipe is the opposite of cat command.
$ mc cat
NAME:
mc cat - Display contents of a file.
Example
#!/bin/bash
mc cat https://s3.amazonaws.com/mybucket/1.txt | mc pipe https://s3-us-west-2.amazonaws.com/mywestbucket/1.txt

To answer the question regarding s3cmd: No, it can not (currently) read from STDIN.
It does support multi-part-upload and also streams to STDIN, but apparently not the other way around.
Piping output from s3cmd works like this:
s3cmd get s3://my-bucket/some_key - | gpg -d | tar -g /dev/null -C / -xvj
Please be aware that there may be an issue with streaming gzip files: https://github.com/s3tools/s3cmd/issues/811

Related

Ruby output as input for system command

I am trying download a ton of files via gsutil (Google Cloud). You can pass a list of URLs to download:
You can pass a list of URLs (one per line) to copy on stdin instead of as command line arguments by using the -I option. This allows you to use gsutil in a pipeline to upload or download files / objects as generated by a program, such as:
some_program | gsutil -m cp -I gs://my-bucket
How can I do this from Ruby, from within the program I mean? I tried to output them but that doesn't seem to work.
urls = ["url1", "url2", "url3"]
`echo #{puts urls} | gsutil -m cp -I gs://my-bucket`
Any idea?
A potential workaround would be to save the URLs in a file and use cat file | gsutil -m cp -I gs://my-bucket but that feels like overkill.
Can you try echo '#{urls.join("\n")}'
If you put puts it returns nil, rather than the string you want to return. The interpolation fails due to the same reason.

How can I pipe a tar compression operation to aws s3 cp?

I'm writing a custom backup script in bash for personal use. The goal is to compress the contents of a directory via tar/gzip, split the compressed archive, then upload the parts to AWS S3.
On my first try writing this script a few months ago, I was able to get it working via something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 - /mnt/SCRATCH/backup.tgz.part
aws s3 sync /mnt/SCRATCH/ s3://backups/ --delete
rm /mnt/SCRATCH/*
This worked well for my purposes, but required /mnt/SCRATCH to have enough disk space to store the compressed directory. Now I wanted to improve this script to not have to rely on having enough space in /mnt/SCRATCH, and did some research. I ended up with something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter "aws s3 cp - s3://backups/backup.tgz.part" -
This almost works, but the target filename on my S3 bucket is not dynamic, and it seems to just overwrite the backup.tgz.part file several times while running. The end result is just one 100MB file, vs the intended several 100MB files with endings like .part0001.
Any guidance would be much appreciated. Thanks!
when using split you can use the env variable $FILE to get the generated file name.
See split man page:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
For your use case you could use something like the following:
--filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE'
(the single quotes are needed, otherwise the environment variable substitution will happen immediately)
Which will generate the following file names on aws:
backup.tgz.partx0000
backup.tgz.partx0001
backup.tgz.partx0002
...
Full example:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE' -
You should be able to get it done quite easily and in parallel using GNU Parallel. It has the --pipe option to split the input data into blocks of size --block and distribute it amongst multiple parallel processes.
So, if you want to use 100MB blocks and use all cores of your CPU in parallel, and append the block number ({#}) to the end of the filename on AWS, your command would look like this:
tar czf - something | parallel --pipe --block 100M --recend '' aws s3 cp - s3://backups/backup.tgz.part{#}
You can use just 4 CPU cores instead of all cores with parallel -j4.
Note that I set the "record end" character to nothing so that it doesn't try to avoid splitting mid-line which is its default behaviour and better suited to text file processing than binary files like tarballs.

Split and copy a file from a bucket to another bucket, without downloading it locally

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.
I'm looking for something with the same final behaviour as the following commands :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally
split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_
gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/
But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).
The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.
I took a look at the docs about streaming uploads and downloads using gsutil to do something like :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...
But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).
This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,
gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

how to unzip the latest file only

am downloading daily FTP File through the following command:
wget -mN --ftp-user=myuser --ftp-password=mypassword ftp://ftp2.link.com/ -P /home/usr/public_html/folder/folder2
my file structure are like this:
Data_69111232_2016-01-29.zip
Data_69111232_2016-01-28.zip
Data_69111232_2016-01-27.zip
can you please let me know how can extract only the latest downloaded file only
usually am using the following command to unzip the file, but i don't know what should i add to extract only the latest file
unzip -o /home/user/public_html/folder/folder2/ftp2.directory/????.zip -d /home/user/public_html/folder/folder2/
you help is really approciated
Thanks in Advance
Updated Answer
I thought your question was about FTP, but it is maybe about finding the newest file to unzip.
You can get the newest file like this:
newest=$(ls -t /home/user/public_html/folder/folder2/ftp2.directory/*zip | head -1)
and see the value like this:
echo $newest
and use it like this:
unzip -o "$newest" ...
Original Answer
You can probably string something together using lftp. For example, I can get a listing in reverse time order with the newest file at the bottom like this:
lftp -e 'cd path/to/daily/file; ls -lrt; bye' -u user,password host | tail -1

How to combine 3 commands into a single process for runit to monitor?

I wrote a script that grabs a set of parameters from two sources using wget commands, stores them into a variables and then executes video transcoding process based on the retrieved parameters. Runit was installed to monitor the process.
The problem is that when I try to stop the process, runit doesnt know that only the last transcoding process needs to be stopped therefore it fails to stop it.
How can I combine all the commands in bash script to act as a single process/app?
The commands are something as follows:
wget address/id.html
res=$(cat res_id | grep id.html)
wget address/length.html
time=$(cat length_id | grep length.html)
/root/bin -i video1.mp4 -s $res.....................
Try wrapping them in a shell:
sh -c '
wget address/id.html
res=$(grep id.html res_id)
wget address/length.html
time=$(grep length.html length_id)
/root/bin -i video1.mp4 -s $res.....................
'

Resources