I need to execute a series of remove file commands (listed below) to an Amazon S3 bucket from a pipe. But I need to execute them line by line using the Command Line Utility (v2) from Amazon S3 and I can't figure out how to do this. So SFTP has a batch file utility built in to read a series of remove commands from a text file but S3 doesn't have this ability. What's my best option here? (This has been taken from a > pipe to text file and I'm using terminals BSD on a mac.)
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-11-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-12-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-13-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-14-2020.mp3
aws s3 rm s3://xxx.yyy/FILE_25Min_PGM_05-15-2020.mp3
Here's the solution: | awk '{ system("aws s3 rm s3://xxx.yyy/" $4) }
$4 is the field which contains the filename. It's not super clear in the AWK manual that this will effectively run a "loop" not just outputting one command but one system command per row with the field specified. Very powerful if you have a list of files either in a text file or being piped into AWK that you want to run a system command on.
Related
I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.
I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl scripts that pipe into a gsutil that uploads to my bucket into an .sh file that looks like the following.
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz
However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh script that downloads to a Cloud Bucket, or if this type of .sh script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.
I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.
Any help with this is greatly appreciated!
Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).
Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)
To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl Commands at the same time and improve the time to complete this task.
To install parallel on ubuntu/debian run this command:
sudo apt-get install parallel
For example you can create a file called downloads with the commands that you want to parallelize (you must write all curl commands in the file)
downloads file
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz
After that, you simply need to run the following command
parallel --job 2 < downloads
This command will run up to 2 parallel curl commands until all the commands in the file have been executed.
Another improvement you can apply to your routine is to use gsutil mv instead gsutil cp, mv command will delete the file after success upload, this can help you to save space on your hard drive.
If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.
Currently, I have Bash commands redirecting output to a log file, and then a separate CLI aws s3 cp call to copy the log file up to S3.
I was wondering if there's a way to redirect output straight to S3 without the extra command/step. I tried doing the aws s3 cp to a https url but that doesn't seem to work since urls are for currently existing files/objects on S3.
I never tested it, but chech if it is reasonable:
aws s3 cp <(/path/command arg1 arg2) s3://mybucket/mykey
Here /path/command arg1 arg2 is your "Bash commands redirecting output to a log file", but you can't redirect output, you need to leave it in the stdout.
Not sure whether its an overkill based on the gravity of your scenario, but using a AWS File Gateway, you can put the files to a mounted disk and it will be synced automatically to S3.
I am using AWS CLI interface to manage files/objects in S3. I have thousands of objects buried in a complex system of nested folders (subfolders), I want to elevate all of the objects to the “root” of the S3 bucket, in one folder at the root of the bucket (s3://bucket/folder/file.txt).
I've tried using this command:
aws s3 s3://bucket-a/folder-a s3://bucket-a --recursive --exclude “*” --include “*.txt”
When I use the mv command, it carries over the prefixes (directory paths) of each object resulting in the same nested folder system. Here is what I want to accomplish:
Desired Result:
Where:
s3://bucket-a/folder-a/file-1.txt
s3://bucket-a/folder-b/folder-b1/file-2.txt
s3://bucket-a/folder-c/folder-c1/folder-c2/ file-3.txt
Output:
s3://bucket-a/file-1.txt
s3://bucket-a/file-2.txt
s3://bucket-a/file-3.txt
I have been told, that I need to use a bash script to accomplish my desired result. Here is a sample script that was provided to me:
#!/bin/bash
#BASH Script to move objects without directory structure
bucketname='my-bucket'
for key in $(aws s3api list-objects --bucket "${my-bucket}" --query "Contents[].{Object:Key}" --output text) ;
do
echo "$key"
FILENAME=$($key | awk '{print $NF}' FS=/)
aws s3 cp s3://$my-bucket/$key s3://$my-bucket/my-folder/$FILENAME
done
When I run this bash script, I get an error:
A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied
I tested the connection with another aws s3 command and confirmed that it works. I added policies to the user to include all privledges to s3, I have no idea what I am doing wrong here.
Any help would be greatly appreciated.
That script looks messed up, no means on setting a variable called bucketname and trying to use another one called my-bucket, what happens if you try this ?
#!/bin/bash
#BASH Script to move objects without directory structure
bucketname='my-bucket'
for key in $(aws s3api list-objects --bucket "${bucketname}" --query "Contents[].{Object:Key}" --output text) ;
do
echo "$key"
FILENAME=$($key | awk '{print $NF}' FS=/)
aws s3 cp s3://$bucketname/$key s3://$bucketname/my-folder/$FILENAME
done
I have created a script to upload the files on a S3 bucket, and I got a timeout error, so I am not sure if all the files are on the bucket. I have created another function for checking the differences, but it seems not to work because of the listing from the local folder:
If I do a find like here, find $FOLDER -type f | cut -d/ -f2- | sort, I get the whole path, like /home/sop/path/to/folder/.... It seems that cut -d/ -f2- does nothing...
If I do a ls -LR I am not getting a list, for being able to compare it with the aws s3api list-objects ... result
The AWS Command-Line Interface (CLI) has a useful aws s3 sync command that can replicate files from a local directory to an Amazon S3 bucket (or vice versa, or between buckets).
It will only copy new/changed files, so it's a great way to make sure files have been uploaded.
See: AWS CLI S3 sync command documentation
I've created a bash script to migrate sites and databases from one server to another: Algorithm:
Parse .pgpass file to create individual dumps for all the specified Postgres db's.
Upload said dumps to another server via rsync.
Upload a bunch of folders related to each db to the other server, also via rsync.
Since databases and folders have the same name, the script can predict the location of the folders if it knows the db name. The problem I'm facing is that the loop is only executing once (only the first line of .pgpass is being completed).
This is my script, to be run in the source server:
#!/bin/bash
# Read each line of the input file, parse args separated by semicolon (:)
while IFS=: read host port db user pswd ; do
# Create the dump. No need to enter the password as we're using .pgpass
pg_dump -U $user -h $host -f "$db.sql" $db
# Create a dir in the destination server to copy the files into
ssh user#destination.server mkdir -p webapps/$db/static/media
# Copy the dump to the destination server
rsync -azhr $db.sql user#destination:/home/user
# Copy the website files and folders to the destination server
rsync -azhr --exclude "*.thumbnails*" webapps/$db/static/media/ user#destination.server:/home/user/webapps/$db/static/media
# At this point I expect the script to continue to the next line, but if exits at the first line
done < $1
This is .pgpass, the file to parse:
localhost:*:db_name1:db_user1:db_pass1
localhost:*:db_name3:db_user2:db_pass2
localhost:*:db_name3:db_user3:db_pass3
# Many more...
And this is how I'm calling it:
./my_script.sh .pgpass
At this point everything works. The first dump is created, and it is transferred to the destination server along with the related files and folders. The problem is the script finishes there, and won't parse the other lines of .pgpass. I've commented out all lines related to rsync (so the script only creates the dumps), and it works correctly, executing once for each line in the script. How can I get the script to not exit after executing rsync?
BTW, I'm using key based ssh auth to connect the servers, so the script is completely prompt-less.
Let's ask shellcheck:
$ shellcheck yourscript
In yourscript line 4:
while IFS=: read host port db user pswd ; do
^-- SC2095: ssh may swallow stdin, preventing this loop from working properly.
In yourscript line 8:
ssh user#destination.server mkdir -p webapps/$db/static/media
^-- SC2095: Add < /dev/null to prevent ssh from swallowing stdin.
And there you go.