how to set watermark when downloading from s3 bucket - bash

I have one s3 bucket which contains zip files.
I have a script which downloading the zip files to my local server, unzip them and upload them to another s3 bucket.
How can I set watermark so i will know what was the last file i downloaded so i won't need to save all the files locally or download all the files each time the script is running ?
I'm using aws sync command which as much as i understand should copy only new files, am i right ?
aws s3 sync $gcs3$gcRegion/$gcTech/$gcPrinterFamily/$gcPrinterType/$gcPrinterName/ $dir

The AWS Command-Line Interface (CLI) aws s3 sync command will copy any files that are not present in the destination (sort of).
So, you either need to keep all previously-downloaded files, or you need another way to keep track of the files that were downloaded.
Instead, I would recommend writing your own program that:
Downloads all files from the S3 bucket with a LastModified timestamp after a stored timestamp
Stores the current time
Unzips the files and copies them to the other S3 bucket
Deletes the zip files and unzipped files
So, the program will need to remember the last time it downloaded files, but it will not need to remember which files it downloaded. Be careful — S3 stores time in UTC, so you'll need to convert your timezones. Or, simply remember the highest LastModified value of the files you downloaded.
To obtain a list of files since a certain LastModified date, you could use the AWS CLI:
aws s3api list-objects --bucket jstack-a --query "Contents[?LastModified>='2019-04-11'].[Key]" --output text
However, I would recommend writing a Python program for the above activities, since it would be easier that writing command-line scripts.

Related

How to execute a bash file containing curl instructions within a Google storage bucket and directly copy the contents to the bucket?

I have a bash script file where each line is a curl command to download a file. This bash file is in a Google bucket.
I would like to execute the file either directly from the storage and copy its downloaded contents there or execute it locally and directly copy its content to the bucket.
Basically, I do not want to have these fils on my local machine.. I have tried things along these lines but it either failed or simply downloaded everything locally.
gsutil cp gs://bucket/my_file.sh - | bash gs://bucket/folder_to_copy_to/
Thank you!
To do so, the bucket needs to be mounted on the pod (the pod would see it as a directory).
If the bucket supports NFS, you would be able to mount it as shown here.
Also, there is another way as shown in this question.
otherwise, you would need to copy the script to the pod, run it, then upload the generated files to the bucket, and lastly clean everything up.
The better option is to use a filestore which can be easily mounted using CSI drivers as mentioned here.

Is there way to transfer all zip files from s3 bucket aws to other computer?

I have module where I need transfer all zip files from s3 bucket to my network computers by by just connecting each ip address \xx.xx.xx.xxx. right now im using laravel.
exec('aws s3 cp s3://compexp/"11-10-2019"/"01150exp.zip"');
I have bucket name: compexp inside of bucket, there are created folder name: example 11-10-2019 inside of of dated folder there are zip files for the reference see the imported image.
zip files
currently this is my reference, but i can't see how can i transfer the files from my network computers.
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
You can use flag --recursive together with --exclude "*" --include "*.zip" to copy only *.zip in folder "11-10-2019".
Regarding the network computers, Do you mean your internal network computers?
You can't pass another server as the destination to copy the contents. You need to copy the files locally first and the transfer to another server. A good way to sync a directory from S3 bucket contents is to use the sync command as mentioned here
aws s3 sync yourLocalDir s3://mybucket
Once you have all the contents synced up in your current directory you can just copy them to a different computer using scp command as mentioned here
scp -r yourLocalDir anotherHost:/directory
-r option of scp is to make sure you copy all the subdirectories recursively

aws s3 cp to a local file was not replacing file

I have a shell script that is running aws s3 cp s3://s3file /home/usr/localfile. The file already exists in that directory, so the cp command is essentially getting the latest copy from S3 to get the latest version.
However, I noticed today that the file was not the latest version; it didn't match the file on S3. Looking at the shell script's stdout from the last two runs, it looks like the command ran - the output is: download: s3://s3file to usr/localfile. But when I compared the copies, they didn't match. The changed timestamp on the file when I view it on the local machine via WinSCP (a file transfer client) didn't change either
I manually ran the command in a shell just now and it copied the file from S3 to the local machine and successfully got the latest copy.
Do I need to add a specific option for this, or is it typical behavior for files to not override a file after aws s3 cp?

Remove directory level when transferring from HDFS to S3 using S3DistCp

I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.
Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here

s3cmd put -preserve flag does not preserve file creation /Modified date when copied to s3 bucket

I am copying files for AWS ec2 to AWS s3 bucket, with --preserver flag to preserver to file create and modified date, but once file is copied to s3bucket, "s3cmd ls s3://bucket-name/" command list the file upload time as file time, it does not preserve the original file creation date-time. I am using following command(s3cmd put --preserve xyz.log s3://bucket-name/) to copy the file. Though s3cmd help list the --preserve or -p as something you can use to preserve the date it does not seems to be working.
Has anybody run in to this kind of issue and can point me what I am doing wrong.
I also tried s3cmd sync but sync command also behave same way, though I would prefer to use put.
s3cmd put --preserve xyz.log s3://bucket-name/
Thanks,
Please try the current upstream github.com/s3tools/s3cmd master branch. This is resolved there. Going round trip (s3cmd sync --preserve file s3://bucket/; rm file; s3cmd sync --preserve s3://bucket/file .;) now restores the atime and mtime values as stored during sync upload.

Resources