Organizing and Lifecycle management of S3 Objects - bash

I am currently working on a script that does both the lifecycle management as well as Organizing of files.
We already have lifecycle management in place, the biggest limitation I found is, lifecycle management only changes the storage class while the files are in the same place where they where.
Example
aws s3api copy-object \
--copy-source s3://dshare/folderM/alia_b_2_3 \
--key alia_b_2_3 \
--bucket s3://dshare/folderIA/alia_b_2_3 \
--storage-class STANDARD_IA
I tried the above command, however I get regex error. Let me know if i'm doing anything wrong.
Error:
Parameter validation failed:
Invalid bucket name "s3://dshare/folderIA/alia_b_2_3": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"
My idea basically is to copy/move the files into another folder under the same or different bucket and at the same time change the Storage Class Type for the files being copied/moved.

aws s3api copy-object expects bucket name, not S3 URI starting with s3://, so in your case you need to only supply dshare as a value

Related

Copy files incrementally from S3 to EBS storage using filters

I wish to move a large set of files from an AWS S3 bucket in one AWS account (source), having systematic filenames following this pattern:
my_file_0_0_0.csv
...
my_file_0_7_200.csv
Into a S3 bucket in another AWS account (target).
These need to be moved by an ec2 instance (to overcome IAM access restrictions) to an attached EBS volume incrementally (to overcome storage limitations).
Clarification:
in the filenames, there are 3 numbers separated by underscores, like so: _a_b_c, where a is always 0, b starts at 0 and goes up to 7, and c goes from 0 to maximally 200 (not guaranteed it will always reach 200).
(I have a SSH session to the EC2 instance through Putty).
1.st iteration:
So what I am trying to do in the first iteration is to copy all files from S3,
that have a name with the following pattern: my_file_0_0_*.csv.
This can be done with the command:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_0_*" --profile source_user
From here, I upload it to my target bucket with the command:
aws s3 cp . s3://my_target_bucket_name/my_folder/ --recursive --profile source_user
And finally delete the files from the ec2 instance's ebs volume with
rm *.
2.nd iteration:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_1_*" --profile source_user
This time, I only get some of the files with pattern my_file_0_1_*, as their combined file sizes reaches 100 GiB which is the limit of my ebs volume.
Here I run into the issue that the filenames are sorted alphabetically and not numerically by the digits in there names. e.g.:
my_file_0_1_0.csv
my_file_0_1_1.csv
my_file_0_1_10.csv
my_file_0_1_100.csv
my_file_0_1_101.csv
my_file_0_1_102.csv
my_file_0_1_103.csv
my_file_0_1_104.csv
my_file_0_1_105.csv
my_file_0_1_106.csv
my_file_0_1_107.csv
my_file_0_1_108.csv
my_file_0_1_109.csv
my_file_0_1_11.csv
After moving them to the target S3 bucket and removing them from ebs,
the challenge is to move the remaining files with pattern my_file_0_1_* in a systematic way. Is there a way to achieve this, e.g. by using find, grep, awk or similar ?
And do I need to cast some filename-slices to integers first ?
You can use sort -V command to consider the proper versioning of files and then invoke copy command on each file one by one or a list of files at a time.
ls | sort -V
If you're on a GNU system, you can also use ls -v. This won't work in MacOS.

how to list all objects in s3 bucket having having specific character in the key using shell script

I have a s3 bucket and below is the directory structure.
bucketname/uid=/year=/month=/day=/files.parquet
In some cases inside year directory I have some temporary object created by athena.Ex:
month=11_$folder$
I want remove all of these files whose key = month=11_$folder$.
Currently I am doing in a loop for all uid. Is there any faster way to do that?
Using the aws cli list-objects-v2 you can search for patterns
aws s3api list-objects-v2 \
--bucket my-bucket \
--query 'Contents[?contains(Key, `month=11_$folder$`)]'
Note this will still query all your objects and only filter what is returned back, so if you have more than 1,000 objects in your bucket, you'll need to paginate

Bash script AWS S3 bucket delete all the files using their names contaning

I'm trying to remove only the files which are ONLY older than 5 days according to the file name containing "DITN1_" and "DITS1_" time using a bash script within the AWS S3 Bucket but the issue is all the files i'm trying to delete looks like as follows:
DITN1_2016.12.01_373,
DITS1_2012.10.10_141,
DITN1_2016.12.01_3732,
DITS1_2012.10.10_1412
if someone can help me out with the code would be nice.
thanks in advance
You can use aws cli command for deleting stuff using the bash script as follows
aws s3 rm s3://mybucket/ --recursive --include "mybucket/DITN1*"
However it does not support timestamp
For details see aws S3 cli
Is it important to use the name of the objects instead of metadata? You could get a list of objects in the bucket using the s3api:
aws s3api list-objects --bucket example --no-paginate # this last option will avoid pagination, don't use it if you have thousands of objects
Adding
--query Contents[]
Will give you back the contents of every object, including a LastModified section, which will tell you when the object was last modified, for example "2016-12-16T13:56:23.000Z".
http://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html
You could change this timestamp to epoch using
date "+%s" -d "put the timestamp here"
And compare it with the current time - 5 days.
OR if you really want to delete objects based on name, you could loop over the keys like this:
for key in $(aws s3api list-objects --bucket example --no-paginate --query Contents[].Key)
And add logic to determine the date. Something like this might work, judging by your examples:
key_without_prefix=${key#*_}
key_without_suffix=${key_without_prefix%_*}
Then you have your date, which you can compare with the current time - 5 days.

How can multiple files be specified with "-files" in the CLI of Amazon for EMR?

I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows:
aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,-
files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-
input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
--ami-version 3.1.0
--instance-groupsInstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
--log-uri s3://betaestimationtest/logs
However, Hadoop now complains that it cannot find the reducer file:
Caused by: java.io.IOException: Cannot run program "reducer.py": error=2, No such file or directory
What am I doing wrong? The file does exist in the folder I specify
For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst. See example 13.
Hope it helps!
You are specifying -files twice, you only need to specify once. I forget if the CLI needs the separator to be a space or a comma for multiple values, but you can try that out.
You should replace:
Args=[-files,s3://betaestimationtest/mapper.py,-files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
with:
Args=[-files,s3://betaestimationtest/mapper.py s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
or if that fails, with:
Args=[-files,s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
Add an escape for comma separating files:
Args=[-files,s3://betaestimationtest/mapper.py\\,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]

Amazon Elastic MapReduce: Output directory

I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error:
Error launching job , Output path already exists.
Here is the command to run the job that I am using:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregate
Here is where the example comes from here
I'm following Amazon'd directions for the output directory. The bucket name is s3n://mp.maptester321mark/. I've looked through all their suggestions for problems on this url
Here is my credentials.json info:
{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}
hadoop jobs won't clobber directories that already exist. You just need to run:
hadoop fs -rmr <output_dir>
before your job ot just use the AWS console to remove the directory.
Use:
--output s3n://mp.maptester321mark/output
instead of:
--output s3n://mp.maptester321mark/
I suppose EMR makes the output bucket before running and that means you'll already have your output directory / if you specify --output s3n://mp.maptester321mark/ and that might be the reason why you get this error.
---> If the folder (bucket) already exists then remove it.
---> If you delete it and you still get the above error make sure your output is like this
s3n://some_bucket_name/your_output_bucket if you have it like this s3n://your_output_bucket/
its an issue with EMR!! as i think it first creates bucket on the path (some_bucket_name) and then tries to create the (your_output_bucket).
Thanks
Hari

Resources