How can I process every file in my S3 bucket with bash? - bash

I have a bunch of large files (100MB - 1GB) in a Bucket. I would like to "map" all those files using a bash script. I am unable to download all files at once because my computer does not have enough storage.
Does anyone have an idea how I can do this? Anything smarter than the following solution?
for file in $(aws s3 ls s3://my-bucket | rev | cut -d' ' -f1 | rev) ; do
aws s3 cp s3://my-bucket/$file $file;
./script;
aws s3 cp $file s3://my-bucket/$file;
rm $file;
done
Explanation:
$(aws s3 ls s3://my-bucket | rev | cut -d' ' -f1 | rev) gets the name of the files in the bucket over which we are iterating
aws s3 cp s3://my-bucket/$file $file downloads a single file
./script runs my custom script
aws s3 cp $file s3://my-bucket/$file overwrites the old file with the new one

Interesting approach to reverse and reverse again - I would never have thought of that, but now one day might.
Taking the context as bash as per the title. Depending on what your script does, you could use other services like Glue for a "smarter"/faster but more complex solution.
I suppose you don't have any "folders" (prefixes) in your bucket. A more configurable and probably more reliable way to get the list to process could be to use the s3api, e.g.:
objects=$(aws s3api list-objects \
--bucket my-bucket \
--output json \
--query "Contents[].Key")
for file in $(echo $objects | jq -r '.[]'); do
aws s3 cp s3://my-bucket/$file $file;
./script;
aws s3 cp $file s3://my-bucket/$file;
rm $file;
done
This assumes you'd have jq installed.
The above doesn't need handling of "folders". You can also do some nicer things like choose a "subfolder" to process, e.g.:
--prefix path/to/process/ \
or filter on a file extension, e.g.:
--query "Contents[?contains(Key, '.mp4')].Key")
Might be nice to use prefixes to be able to handle restarts of the script or list limits being hit (e.g. default --page-size of 1000 for aws s3 ls).

Related

Shell script to fetch S3 bucket size with AWS CLI

I have this script that fetches all the buckets in AWS along with the size. But when I am running the script its fetching the bucket, but when running the loop for fetching the size, its throwing error. can someone point me where I am going wrong here. bcos when I am running the awscli commands for individual bucket, its fetching the size without any issues.
The desired output wille be as below, but for all the buckets, I have fetched for one bucket.
Desired ouptut:
aws --profile aws-stage s3 ls s3://<bucket> --recursive --human-readable --summarize | awk END'{print}'
Total Size: 75.1 KiB
Error:
Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Script:
#!/bin/bash
aws_profile=('aws-stage' 'aws-prod');
#loop AWS profiles
for i in "${aws_profile[#]}"; do
echo "${i}"
buckets=$(aws --profile "${i}" s3 ls s3:// --recursive | awk '{print $3}')
#loop S3 buckets
for j in "${buckets[#]}"; do
echo "${j}"
aws --profile "${i}" s3 ls s3://"${j}" --recursive --human-readable --summarize | awk END'{print}'
done
done
Try this:
#!/bin/bash
aws_profiles=('aws-stage' 'aws-prod');
for profile in "${aws_profiles[#]}"; do
echo "$profile"
read -rd "\n" -a buckets <<< "$(aws --profile "$profile" s3 ls | cut -d " " -f3)"
for bucket in "${buckets[#]}"; do
echo "$bucket"
aws --profile "$profile" s3 ls s3://"$bucket" --human-readable --summarize | awk END'{print}'
done
done
The problem was that your buckets was a single string, rather than an array.

Using makefile to download a file from AWS to local

I want to set up a target which downloads the latest s3 file containing _id_config within a path. So I know I can get the name of file I am interested in by
FILE=$(shell aws s3 ls s3:blah//xyz/mno/here --recursive | sort | tail -n 2 | awk '{print $4}' | grep id_config)
Now, I want to download the file to local with something like
download_stuff:
aws s3 cp s3://prod_an.live.data/$FILE .
But when I run this, my $FILE has some extra stuff like
aws s3 cp s3://blah/2022-02-17 16:02:21 2098880 blah//xyz/mno/here54fa8c68e41_id_config.json .
Unknown options: 2098880,blah/xyz/mno/here54fa8c68e41_id_config.json,.
Please can someone help me understand why 2098880 and the spaces are there in the output and how to resolve this. Thank you in advance.
Suggesting a trick with ls options -1 and -t to get the latest files in a folder:
FILE=$(shell aws s3 ls -1t s3:blah//xyz/mno/here |head -n 2 | grep id_config)

aws s3 ls - how to recursively list objects with bash script avoid pagination error

I have on premise AWS S3 like storage. I need to list all files on specific bucket. When I am doing it at the top of the bucket I am getting error:
Error during pagination: The same next token was received twice:{'ContinuationToken':"file path"}
It happens when two many objects needs to be listed I think. This is something wrong at storage side but there is no cure for that right now.
I did a workaround for that and run S3 ls in the bash loop while. I manage to prepare a simple loop for different bucket where I have a much fewer number of objects. That loop were operating deep inside where I knew how many dirs I have.
./aws --profile us-bucket --endpoint-url https://endpoint:18082 --no-verify-ssl s3 ls us-bucket/dir1/dir2/dir3/dir4/dir5/dir6/ | tr -s ' '| tr '/' ' ' | awk '{print $2}' | while read line0; do ./aws --profile us-bucket --endpoint-url https://endpoint:18082 --no-verify-ssl s3 ls us-bucket/dir1/dir2/dir3/dir4/dir5/dir6/${line0}/| tr -s ' '| tr '/' ' ' | awk '{print $2}' | while read line1; do ./aws --profile us-bucket --endpoint-url https://endpoint:18082 --no-verify-ssl s3 ls us-bucket/dir1/dir2/dir3/dir4/dir5/dir6/${line0}/${line1}/| tr -s ' '| tr '/' ' ' | awk '{print $2}' |while read line2; do ./aws --profile us-bucket --endpoint-url https://endpoint:18082 --no-verify-ssl s3 ls --recursive us-bucket/dir1/dir2/dir3/dir4/dir5/dir6/${line0}/${line1}/${line2}/;done;done;done > /tmp/us-bucket/us-bucket_dir2_dir3_dir4_dir5_dir6.txt
I would like to write loop which go from the top or root (how you prefer) and list all files (no matter how many dir we have on the path) from the last dir in the path going up to avoid appearing:
Error during pagination: The same next token was received twice:{'ContinuationToken':"file path"}
Any help/clues appreciated. Thanks.
Br,
Jay

Delete files after awk command

I'm setting to do an ls in a bucket.
Make a print in folder name
remove the /, do sorting and remove the last 3.
which will be the most recent, then I'm setting remove the folds except for those 3 recent ones.
for i in $(aws s3 ls s3://portal-storage-site | awk -F '-' '{print $2}' | sed 's/\///g'| sort -n| tail -3| xargs| sed 's/ /|/g');
do aws s3 ls s3://portal-storage-site| grep -Ev "PRE\s.*\-($i)\/" | awk '{print $2}'|xargs echo "aws s3 ls s3://portal-storage-site/"; done
I expect the output is exec
aws s3 ls s3://portal-storage-site/2e5d0599-120/
aws s3 ls s3://portal-storage-site/6f08a223-118/
aws s3 ls s3://portal-storage-site/ba67667e-121/
aws s3 ls s3://portal-storage-site/ba67667e-122/
but the actual is
aws s3 ls s3://portal-storage-site/2e5d0599-119/ 2e5d0599-120/ 6f08a223-118/ ba67667e-121/ ba67667e-122/
Instead of using xargs you can try to compose your second aws ls command in awk and send it to bash:
aws s3 ls s3://portal-storage-site| grep -Ev "PRE\s.*\-($i)\/" | awk '{print "aws s3 ls s3://portal-storage-site/" $2}'| bash

How can I pipe a tar compression operation to aws s3 cp?

I'm writing a custom backup script in bash for personal use. The goal is to compress the contents of a directory via tar/gzip, split the compressed archive, then upload the parts to AWS S3.
On my first try writing this script a few months ago, I was able to get it working via something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 - /mnt/SCRATCH/backup.tgz.part
aws s3 sync /mnt/SCRATCH/ s3://backups/ --delete
rm /mnt/SCRATCH/*
This worked well for my purposes, but required /mnt/SCRATCH to have enough disk space to store the compressed directory. Now I wanted to improve this script to not have to rely on having enough space in /mnt/SCRATCH, and did some research. I ended up with something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter "aws s3 cp - s3://backups/backup.tgz.part" -
This almost works, but the target filename on my S3 bucket is not dynamic, and it seems to just overwrite the backup.tgz.part file several times while running. The end result is just one 100MB file, vs the intended several 100MB files with endings like .part0001.
Any guidance would be much appreciated. Thanks!
when using split you can use the env variable $FILE to get the generated file name.
See split man page:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
For your use case you could use something like the following:
--filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE'
(the single quotes are needed, otherwise the environment variable substitution will happen immediately)
Which will generate the following file names on aws:
backup.tgz.partx0000
backup.tgz.partx0001
backup.tgz.partx0002
...
Full example:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE' -
You should be able to get it done quite easily and in parallel using GNU Parallel. It has the --pipe option to split the input data into blocks of size --block and distribute it amongst multiple parallel processes.
So, if you want to use 100MB blocks and use all cores of your CPU in parallel, and append the block number ({#}) to the end of the filename on AWS, your command would look like this:
tar czf - something | parallel --pipe --block 100M --recend '' aws s3 cp - s3://backups/backup.tgz.part{#}
You can use just 4 CPU cores instead of all cores with parallel -j4.
Note that I set the "record end" character to nothing so that it doesn't try to avoid splitting mid-line which is its default behaviour and better suited to text file processing than binary files like tarballs.

Resources