Copy files incrementally from S3 to EBS storage using filters - bash

I wish to move a large set of files from an AWS S3 bucket in one AWS account (source), having systematic filenames following this pattern:
my_file_0_0_0.csv
...
my_file_0_7_200.csv
Into a S3 bucket in another AWS account (target).
These need to be moved by an ec2 instance (to overcome IAM access restrictions) to an attached EBS volume incrementally (to overcome storage limitations).
Clarification:
in the filenames, there are 3 numbers separated by underscores, like so: _a_b_c, where a is always 0, b starts at 0 and goes up to 7, and c goes from 0 to maximally 200 (not guaranteed it will always reach 200).
(I have a SSH session to the EC2 instance through Putty).
1.st iteration:
So what I am trying to do in the first iteration is to copy all files from S3,
that have a name with the following pattern: my_file_0_0_*.csv.
This can be done with the command:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_0_*" --profile source_user
From here, I upload it to my target bucket with the command:
aws s3 cp . s3://my_target_bucket_name/my_folder/ --recursive --profile source_user
And finally delete the files from the ec2 instance's ebs volume with
rm *.
2.nd iteration:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_1_*" --profile source_user
This time, I only get some of the files with pattern my_file_0_1_*, as their combined file sizes reaches 100 GiB which is the limit of my ebs volume.
Here I run into the issue that the filenames are sorted alphabetically and not numerically by the digits in there names. e.g.:
my_file_0_1_0.csv
my_file_0_1_1.csv
my_file_0_1_10.csv
my_file_0_1_100.csv
my_file_0_1_101.csv
my_file_0_1_102.csv
my_file_0_1_103.csv
my_file_0_1_104.csv
my_file_0_1_105.csv
my_file_0_1_106.csv
my_file_0_1_107.csv
my_file_0_1_108.csv
my_file_0_1_109.csv
my_file_0_1_11.csv
After moving them to the target S3 bucket and removing them from ebs,
the challenge is to move the remaining files with pattern my_file_0_1_* in a systematic way. Is there a way to achieve this, e.g. by using find, grep, awk or similar ?
And do I need to cast some filename-slices to integers first ?

You can use sort -V command to consider the proper versioning of files and then invoke copy command on each file one by one or a list of files at a time.
ls | sort -V
If you're on a GNU system, you can also use ls -v. This won't work in MacOS.

Related

how to list all objects in s3 bucket having having specific character in the key using shell script

I have a s3 bucket and below is the directory structure.
bucketname/uid=/year=/month=/day=/files.parquet
In some cases inside year directory I have some temporary object created by athena.Ex:
month=11_$folder$
I want remove all of these files whose key = month=11_$folder$.
Currently I am doing in a loop for all uid. Is there any faster way to do that?
Using the aws cli list-objects-v2 you can search for patterns
aws s3api list-objects-v2 \
--bucket my-bucket \
--query 'Contents[?contains(Key, `month=11_$folder$`)]'
Note this will still query all your objects and only filter what is returned back, so if you have more than 1,000 objects in your bucket, you'll need to paginate

s3 awk bash pipeline

Following this question Splitting out a large file.
I would like to pipe calls from an Amazon s3:// bucket containing large gzipped files, process them with an awk command.
Sample file to process
...
{"captureTime": "1534303617.738","ua": "..."}
...
Script to optimize
aws s3 cp s3://path/to/file.gz - \
| gzip -d \
| awk -F'"' '{date=strftime("%Y%m%d%H",$4); print > "splitted."date }'
gzip splitted.*
# make some visual checks here before copying to S3
aws s3 cp splitted.*.gz s3://path/to/splitted/
Do you think I can wrap everything in the same pipeline to avoid writing files locally?
I can use Using gzip to compress files to transfer with aws command to be able to gzip and copy on the fly, but gzipping inside awk would be great.
Thank you.
Took me a bit to understand that your pipeline creates one "splitted.date file for each line in the source file. Since shell pipelines operate on byte streams and not files, while S3 operates on files (objects), you must turn your byte stream into a set of files on local storage before sending them back to S3. So, a pipeline by itself won't suffice.
But I'll ask: what's the larger purpose you trying to accomplish?
You're on the path to generating lots of S3 objects, one for each line of your "large gzipped files". Is this using S3 as a key value store? I'll ask if this is the best design for the goal of your effort? In other words, is S3 the best repository for this information or is here some other store (DynamoDB, or another NoSQL) that would be a better solution?
All the best
Two possible optimizations :
On large and multiple files it will help to use all the cores to gzip the files, use xargs, pigz or gnu parallel
Gzip with all cores
parallelize S3 upload :
https://github.com/aws-samples/aws-training-demo/tree/master/course/architecting/s3_parallel_upload

Bash script AWS S3 bucket delete all the files using their names contaning

I'm trying to remove only the files which are ONLY older than 5 days according to the file name containing "DITN1_" and "DITS1_" time using a bash script within the AWS S3 Bucket but the issue is all the files i'm trying to delete looks like as follows:
DITN1_2016.12.01_373,
DITS1_2012.10.10_141,
DITN1_2016.12.01_3732,
DITS1_2012.10.10_1412
if someone can help me out with the code would be nice.
thanks in advance
You can use aws cli command for deleting stuff using the bash script as follows
aws s3 rm s3://mybucket/ --recursive --include "mybucket/DITN1*"
However it does not support timestamp
For details see aws S3 cli
Is it important to use the name of the objects instead of metadata? You could get a list of objects in the bucket using the s3api:
aws s3api list-objects --bucket example --no-paginate # this last option will avoid pagination, don't use it if you have thousands of objects
Adding
--query Contents[]
Will give you back the contents of every object, including a LastModified section, which will tell you when the object was last modified, for example "2016-12-16T13:56:23.000Z".
http://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html
You could change this timestamp to epoch using
date "+%s" -d "put the timestamp here"
And compare it with the current time - 5 days.
OR if you really want to delete objects based on name, you could loop over the keys like this:
for key in $(aws s3api list-objects --bucket example --no-paginate --query Contents[].Key)
And add logic to determine the date. Something like this might work, judging by your examples:
key_without_prefix=${key#*_}
key_without_suffix=${key_without_prefix%_*}
Then you have your date, which you can compare with the current time - 5 days.

How do I Copy the same AMI to multiple regions simultaneously?

I am trying to find a way to perform a simultaneously copy of a AMI to all other regions.
I have search near and far but beside seeing on a blog post that it can be done, I haven't found a way using aws cli ...
https://aws.amazon.com/blogs/aws/ec2-ami-copy-between-regions/
Currently I have written a bash script to do so, but I would like to find a better, easier way to do so
I have 8 AMI's that need to be passed to all regions.
using an array-
declare -a DEST=('us-east-1' ...2....3)
aws copy-image --source-region $SRC --region ${DESTx[#]} --source-ami-id $ami
Do you guys have any other suggestion?
Thanks.
you can make a single line bash, specially useful if in future there are new regions:
aws ec2 describe-regions
--output text |\
cut -f 3 | \
xargs -I {} aws copy-image
--source-region $SRC
--region {}
--source-ami-id $ami
basically it goes like this:
aws ec2 describe-regions --output text returns the list of all available regions for ec2, its a 3 columns table ("REGIONS", endpoint, region-name)
cut -f 3 takes the 3rd column of the previous table (read as list)
keep the current region from previous argument (xargs) into {} so you can send it to the region parameter of the copy-image command

Amazon S3 Command Line Copy all objects to themselves setting Cache control

I have an Amazon S3 bucket with about 300K objects in it and need to set the Cache-control header on all of them. Unfortunately it seems like the only way to do this, besides one at a time, is by copying the objects to themselves and setting the cache control header that way:
http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Is the documentation for the Amazon S3 CLI copy command but I have been unsuccessful setting the cache control header using it. Does anyone have an example command that would work for this. I am trying to set cache-control to max-age=1814400
Some background material:
Set cache-control for entire S3 bucket automatically (using bucket policies?)
https://forums.aws.amazon.com/thread.jspa?messageID=567440
By default, aws-cli only copies a file's current metadata, EVEN IF YOU SPECIFY NEW METADATA.
To use the metadata that is specified on the command line, you need to add the '--metadata-directive REPLACE' flag. Here are some examples.
For a single file
aws s3 cp s3://mybucket/file.txt s3://mybucket/file.txt --metadata-directive REPLACE \
--expires 2100-01-01T00:00:00Z --acl public-read --cache-control max-age=2592000,public
For an entire bucket:
aws s3 cp s3://mybucket/ s3://mybucket/ --recursive --metadata-directive REPLACE \
--expires 2100-01-01T00:00:00Z --acl public-read --cache-control max-age=2592000,public
A little gotcha I found, if you only want to apply it to a specific file type, you need to exclude all the files, then include the ones you want.
Only jpgs and pngs
aws s3 cp s3://mybucket/ s3://mybucket/ --exclude "*" --include "*.jpg" --include "*.png" \
--recursive --metadata-directive REPLACE --expires 2100-01-01T00:00:00Z --acl public-read \
--cache-control max-age=2592000,public
Here are some links to the manual if you need more info:
http://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html
http://docs.aws.amazon.com/cli/latest/reference/s3/cp.html#options

Resources