How to compare the content of a local folder with Amazon S3? - shell

I have created a script to upload the files on a S3 bucket, and I got a timeout error, so I am not sure if all the files are on the bucket. I have created another function for checking the differences, but it seems not to work because of the listing from the local folder:
If I do a find like here, find $FOLDER -type f | cut -d/ -f2- | sort, I get the whole path, like /home/sop/path/to/folder/.... It seems that cut -d/ -f2- does nothing...
If I do a ls -LR I am not getting a list, for being able to compare it with the aws s3api list-objects ... result

The AWS Command-Line Interface (CLI) has a useful aws s3 sync command that can replicate files from a local directory to an Amazon S3 bucket (or vice versa, or between buckets).
It will only copy new/changed files, so it's a great way to make sure files have been uploaded.
See: AWS CLI S3 sync command documentation

Related

Best way to run bash script on Google Cloud to bulk download to Bucket

I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.
I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl scripts that pipe into a gsutil that uploads to my bucket into an .sh file that looks like the following.
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz
However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh script that downloads to a Cloud Bucket, or if this type of .sh script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.
I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.
Any help with this is greatly appreciated!
Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).
Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)
To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl Commands at the same time and improve the time to complete this task.
To install parallel on ubuntu/debian run this command:
sudo apt-get install parallel
For example you can create a file called downloads with the commands that you want to parallelize (you must write all curl commands in the file)
downloads file
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz
After that, you simply need to run the following command
parallel --job 2 < downloads
This command will run up to 2 parallel curl commands until all the commands in the file have been executed.
Another improvement you can apply to your routine is to use gsutil mv instead gsutil cp, mv command will delete the file after success upload, this can help you to save space on your hard drive.
If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.

Bash SFTP batch file equivalent for Amazon S3

I need to execute a series of remove file commands (listed below) to an Amazon S3 bucket from a pipe. But I need to execute them line by line using the Command Line Utility (v2) from Amazon S3 and I can't figure out how to do this. So SFTP has a batch file utility built in to read a series of remove commands from a text file but S3 doesn't have this ability. What's my best option here? (This has been taken from a > pipe to text file and I'm using terminals BSD on a mac.)
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-11-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-12-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-13-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-14-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-15-2020.mp3
Here's the solution: | awk '{ system("aws s3 rm s3://xxx.yyy/" $4) }
$4 is the field which contains the filename. It's not super clear in the AWK manual that this will effectively run a "loop" not just outputting one command but one system command per row with the field specified. Very powerful if you have a list of files either in a text file or being piped into AWK that you want to run a system command on.

How to fix data import to Windows EC2 from S3 bucket

I got a Windows Server 2012 R2 EC2 instance and fail to import txt-files from a S3 bucket.
I want to set up a regular data import from an S3 bucket to the EC2 instance using the aws-cli. To test the command, I opened the command prompt with administration rights, navigated to the directory, where I want to import the files and run the following command.
aws s3 cp s3://mybuckt/ . --recursive
Then I get an error like the following for every file in the bucket:
download failed: s3://mybuckt/filename.txt to .\filename.txt [Error 87] The parameter is incorrect
I end up with a list of empty files in my directory. The list is equal to that on the bucket but the text files are plain empty.
When I try the command without recursive, nothing happens. No error messages, no files copied.
aws s3 cp s3://mybuckt/ .
Here are my questions:
Why is the recursive option wrong when I import the files?
What can I check in the configuration of the EC2 instance, to verify that it is correctly set up for the data import?
You did not specify any files to copy. You should use:
aws s3 cp s3://mybuckt/* . --recursive
Or, you could use:
aws s3 sync s3://mybuckt/ . --recursive
The solution to my specific problem here was that I had to specify the file name on the bucket as well as on my EC2 instance.
aws s3 cp s3://mybuckt/file.txt nameOnMyEC2.txt

Unable to upload file to S3 bucket

I am trying to send a AD backup folder to a AWS s3 bucket on a windows 2016 server machine, via cmd line.
aws s3 cp “D:\WindowsImageBackup” s3://ad-backup/
However I get the below error.
Invalid length for parameter Key, value: 0, valid range: 1-inf
The folder I am trying to upload has some large files in so not sure if its too big. I have tested the bucket and smaller files work.
Thanks
You have to use --recursive option to upload a folder:
aws s3 cp --recursive “D:\WindowsImageBackup” s3://ad-backup/
Or pack that folder into a single file and upload that file with plain aws s3 cp.

AWS S3: Remove Object Prefix From Thousands of Files in Complex Directory Structure

I am using AWS CLI interface to manage files/objects in S3. I have thousands of objects buried in a complex system of nested folders (subfolders), I want to elevate all of the objects to the “root” of the S3 bucket, in one folder at the root of the bucket (s3://bucket/folder/file.txt).
I've tried using this command:
aws s3 s3://bucket-a/folder-a s3://bucket-a --recursive --exclude “*” --include “*.txt”
When I use the mv command, it carries over the prefixes (directory paths) of each object resulting in the same nested folder system. Here is what I want to accomplish:
Desired Result:
Where:
s3://bucket-a/folder-a/file-1.txt
s3://bucket-a/folder-b/folder-b1/file-2.txt
s3://bucket-a/folder-c/folder-c1/folder-c2/ file-3.txt
Output:
s3://bucket-a/file-1.txt
s3://bucket-a/file-2.txt
s3://bucket-a/file-3.txt
I have been told, that I need to use a bash script to accomplish my desired result. Here is a sample script that was provided to me:
#!/bin/bash
#BASH Script to move objects without directory structure
bucketname='my-bucket'
for key in $(aws s3api list-objects --bucket "${my-bucket}" --query "Contents[].{Object:Key}" --output text) ;
do
echo "$key"
FILENAME=$($key | awk '{print $NF}' FS=/)
aws s3 cp s3://$my-bucket/$key s3://$my-bucket/my-folder/$FILENAME
done
When I run this bash script, I get an error:
A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied
I tested the connection with another aws s3 command and confirmed that it works. I added policies to the user to include all privledges to s3, I have no idea what I am doing wrong here.
Any help would be greatly appreciated.
That script looks messed up, no means on setting a variable called bucketname and trying to use another one called my-bucket, what happens if you try this ?
#!/bin/bash
#BASH Script to move objects without directory structure
bucketname='my-bucket'
for key in $(aws s3api list-objects --bucket "${bucketname}" --query "Contents[].{Object:Key}" --output text) ;
do
echo "$key"
FILENAME=$($key | awk '{print $NF}' FS=/)
aws s3 cp s3://$bucketname/$key s3://$bucketname/my-folder/$FILENAME
done

Resources