I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.
I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl scripts that pipe into a gsutil that uploads to my bucket into an .sh file that looks like the following.
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz
However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh script that downloads to a Cloud Bucket, or if this type of .sh script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.
I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.
Any help with this is greatly appreciated!
Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).
Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)
To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl Commands at the same time and improve the time to complete this task.
To install parallel on ubuntu/debian run this command:
sudo apt-get install parallel
For example you can create a file called downloads with the commands that you want to parallelize (you must write all curl commands in the file)
downloads file
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz
After that, you simply need to run the following command
parallel --job 2 < downloads
This command will run up to 2 parallel curl commands until all the commands in the file have been executed.
Another improvement you can apply to your routine is to use gsutil mv instead gsutil cp, mv command will delete the file after success upload, this can help you to save space on your hard drive.
If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.
For example: upon launching my EC2 instance, I would like to automatically run
docker login
so I can pull a private image from dockerhub and run it. To login to dockerhub I need to input a username and password, and this is what I would like to automate but haven't been able to figure out how.
I do know that you can pass in a script to be ran on launch via User Data. The issue is that my script expects input and I would like to automate entering that input.
Thanks in advance!
If just entering a password for docker login is your problem then I would suggest searching for a manual for docker login. 30 secs on Google gave me this link:
https://docs.docker.com/engine/reference/commandline/login/
It suggests something of the form
docker login --username foo --password-stdin < ~/my_password.txt
Which will read the password from a file my_password.txt in the current users home directory.
Seems like the easiest solution for you here is to modify your script to accept command line parameters, and pass those in with the UserData string.
Keep in mind that this will require you to change your launch configs every time your password changes.
The better solution here is to store your containers in ECS, and let AWS handle the authentication for you (as far as pulling the correct containers from a repo).
Your UserData then turns into something along:
#!/bin/bash
mkdir -p /etc/ecs
rm -f /etc/ecs/ecs.config # cleans up any old files on this instance
echo ECS_LOGFILE=/log/ecs-agent.log >> /etc/ecs/ecs.config
echo ECS_LOGLEVEL=info >> /etc/ecs/ecs.config
echo ECS_DATADIR=/data >> /etc/ecs/ecs.config
echo ECS_CONTAINER_STOP_TIMEOUT=5m >> /etc/ecs/ecs.config
echo ECS_CLUSTER=<your-cluster-goes-here> >> /etc/ecs/ecs.config
docker pull amazon/amazon-ecs-agent
docker run --name ecs-agent --detach=true --restart=on-failure:10 --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/log/ecs/:/log --volume=/var/lib/ecs/data:/data --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro --volume=/var/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro --publish=127.0.0.1:51678:51678 --env-file=/etc/ecs/ecs.config amazon/amazon-ecs-agent:latest
You may or may not need all the volumes specified above.
This setup lets the AWS ecs-agent handle your container orchestration for you.
Below is what I could suggest at this moment -
Create a S3 bucket i.e mybucket.
Put a text file(doc_pass.txt) with your password into that S3 bucket
Create a IAM policy which has GET access to just that particular S3 bucket & add this policy to the EC2 instance role.
Put below script in you user data -
aws s3 cp s3://mybucket/doc_pass.txt doc_pass.txt
cat doc_pass.txt | docker login --username=YOUR_USERNAME --password-stdin
This way you just need to make your S3 bucket secure, no secrets gets displayed in the user data.
I have an EC2 ASG on AWS and I'm interested in storing the shell script that's used to instantiate any given instance in an S3 bucket and have it downloaded and run upon instantiation, but it all feels a little rickety even though I'm using an IAM Instance Role, transferring via HTTPS, and encrypting the script itself while at rest in the S3 bucket using KMS using S3 Server Side Encryption (because the KMS method was throwing an 'Unknown' error).
The Setup
Created an IAM Instance Role that gets assigned to any instance in my ASG upon instantiation, resulting in my AWS creds being baked into the instance as ENV vars
Uploaded and encrypted my Instance-Init.sh script to S3 resulting in a private endpoint like so : https://s3.amazonaws.com/super-secret-bucket/Instance-Init.sh
In The User-Data Field
I input the following into the User Data field when creating the Launch Configuration I want my ASG to use:
#!/bin/bash
apt-get update
apt-get -y install python-pip
apt-get -y install awscli
cd /home/ubuntu
aws s3 cp s3://super-secret-bucket/Instance-Init.sh . --region us-east-1
chmod +x Instance-Init.sh
. Instance-Init.sh
shred -u -z -n 27 Instance-Init.sh
The above does the following:
Updates package lists
Installs Python (required to run aws-cli)
Installs aws-cli
Changes to the /home/ubuntu user directory
Uses the aws-cli to download the Instance-Init.sh file from S3. Due to the IAM Role assigned to my instance, my AWS creds are automagically discovered by aws-cli. The IAM Role also grants my instance the permissions necessary to decrypt the file.
Makes it executable
Runs the script
Deletes the script after it's completed.
The Instance-Init.sh Script
The script itself will do stuff like setting env vars and docker run the containers that I need deployed on my instance. Kinda like so:
#!/bin/bash
export MONGO_USER='MyMongoUserName'
export MONGO_PASS='Top-Secret-Dont-Tell-Anyone'
docker login -u <username> -p <password> -e <email>
docker run - e MONGO_USER=${MONGO_USER} -e MONGO_PASS=${MONGO_PASS} --name MyContainerName quay.io/myQuayNameSpace/MyAppName:latest
Very Handy
This creates a very handy way to update User-Data scripts without the need to create a new Launch Config every time you need to make a minor change. And it does a great job of getting env vars out of your codebase and into a narrow, controllable space (the Instance-Init.sh script itself).
But it all feels a little insecure. The idea of putting my master DB creds into a file on S3 is unsettling to say the least.
The Questions
Is this a common practice or am I dreaming up a bad idea here?
Does the fact that the file is downloaded and stored (albeit briefly) on the fresh instance constitute a vulnerability at all?
Is there a better method for deleting the file in a more secure way?
Does it even matter whether the file is deleted after it's run? Considering the secrets are being transferred to env vars it almost seems redundant to delete the Instance-Init.sh file.
Is there something that I'm missing in my nascent days of ops?
Thanks for any help in advance.
What you are describing is almost exactly what we are using to instantiate Docker containers from our registry (we now use v2 self-hosted/private, s3-backed docker-registry instead of Quay) into production. FWIW, I had the same "this feels rickety" feeling that you describe when first treading this path, but after almost a year now of doing it -- and compared to the alternative of storing this sensitive configuration data in a repo or baked into the image -- I'm confident it's one of the better ways of handling this data. Now, that being said, we are currently looking at using Hashicorp's new Vault software for deploying configuration secrets to replace this "shared" encrypted secret shell script container (say that five times fast). We are thinking that Vault will be the equivalent of outsourcing crypto to the open source community (where it belongs), but for configuration storage.
In fewer words, we haven't run across many problems with a very similar situation we've been using for about a year, but we are now looking at using an external open source project (Hashicorp's Vault) to replace our homegrown method. Good luck!
An alternative to Vault is to use credstash, which leverages AWS KMS and DynamoDB to achieve a similar goal.
I actually use credstash to dynamically import sensitive configuration data at container startup via a simple entrypoint script - this way the sensitive data is not exposed via docker inspect or in docker logs etc.
Here's a sample entrypoint script (for a Python application) - the beauty here is you can still pass in credentials via environment variables for non-AWS/dev environments.
#!/bin/bash
set -e
# Activate virtual environment
. /app/venv/bin/activate
# Pull sensitive credentials from AWS credstash if CREDENTIAL_STORE is set with a little help from jq
# AWS_DEFAULT_REGION must also be set
# Note values are Base64 encoded in this example
if [[ -n $CREDENTIAL_STORE ]]; then
items=$(credstash -t $CREDENTIAL_STORE getall -f json | jq 'to_entries | .[]' -r)
keys=$(echo $items | jq .key -r)
for key in $keys
do
export $key=$(echo $items | jq 'select(.key=="'$key'") | .value' -r | base64 --decode)
done
fi
exec $#
I've uncommented the backup_bucket: line in rubber.yml, and now my db gets backed up both locally and to my S3 bucket. I would like to have my db only backing up to S3. Is there a way to disable local backup, while still keeping S3 backup?
The only way I was able to do this, was add the following to the db backup crontab job (config/rubber/role/db/crontab):
&& rm -rf /mnt/db_backups/*
I ran a ruby script from Heroku bash that generates a CSV file on the server that I want to download. I tried moving it to the public folder to download, but that didn't work. I figured out that after every session in the Heroku bash console, the files delete. Is there a command to download directly from the Heroku bash console?
If you manage to create the file from heroku run bash, you could use transfer.sh.
You can even encrypt the file before you transfer it.
cat <file_name> | gpg -ac -o- | curl -X PUT -T "-" https://transfer.sh/<file_name>.gpg
And then download and decrypt it on the target machine
curl https://transfer.sh/<hash>/<file_name>.gpg | gpg -o- > <file_name>
There is heroku ps:copy:
#$ heroku help ps:copy
Copy a file from a dyno to the local filesystem
USAGE
$ heroku ps:copy FILE
OPTIONS
-a, --app=app (required) app to run command against
-d, --dyno=dyno specify the dyno to connect to
-o, --output=output the name of the output file
-r, --remote=remote git remote of app to use
DESCRIPTION
Example:
$ heroku ps:copy FILENAME --app murmuring-headland-14719
Example run:
#$ heroku ps:copy app.json --app=app-example-prod --output=app.json.from-heroku
Copying app.json to app.json.from-heroku
Establishing credentials... done
Connecting to web.1 on ⬢ app-example-prod...
Downloading... ████████████████████████▏ 100% 00:00
Caveat
This seems not to run with dynos that are run via heroku run.
Example
#$ heroku ps:copy tmp/some.log --app app-example-prod --dyno run.6039 --output=tmp/some.heroku.log
Copying tmp/some.log to tmp/some.heroku.log
Establishing credentials... error
▸ Could not connect to dyno!
▸ Check if the dyno is running with `heroku ps'
It is! Prove:
#$ heroku ps --app app-example-prod
=== run: one-off processes (1)
run.6039 (Standard-1X): up 2019/08/29 12:09:13 +0200 (~ 16m ago): bash
=== web (Standard-2X): elixir --sname dyno -S mix phx.server --no-compile (2)
web.1: up 2019/08/29 10:41:35 +0200 (~ 1h ago)
web.2: up 2019/08/29 10:41:39 +0200 (~ 1h ago)
I could connect to web.1 though:
#$ heroku ps:copy tmp/some.log --app app-example-prod --dyno web.1 --output=tmp/some.heroku.log
Copying tmp/some.log to tmp/some.heroku.log
Establishing credentials... done
Connecting to web.1 on ⬢ app-example-prod...
▸ ERROR: Could not transfer the file!
▸ Make sure the filename is correct.
So I fallen back to using SCP scp -P PORT tmp/some.log user#host:/path/some.heroku.log from the run.6039 dyno command line.
Now that https://transfer.sh is defunct, https://file.io is an alternative. To upload myfile.csv:
$ curl -F "file=#myfile.csv" https://file.io
The response will include a link you can access the file at:
{"success":true,"key":"2ojE41","link":"https://file.io/2ojE41","expiry":"14 days"}
I can't vouch for the security of file.io, so using encryption as described in other answers could be a good idea.
Heroku dyno filesystems are ephemeral, non-persistant and not shared between dynos. So when you do heroku run bash, you actually get a new dyno with a fresh deployment of you app without any of the changes made to ephemeral filesystems in other dynos.
If you want to do something like this, you should probably either do it all in a heroku run bash session or all in a request to a web app running on Heroku that responds with the CSV file you want.
I did as the following:
First I entered heroku bash with this command:
heroku run 'sh'
Then made a directory and moved the file to that
Made a git repository and commited the file
Finally I pushed this repository to github
Before commiting, git will ask you for your name and email. Give it something fake!
If you have files bigger than 100 Mg, push to gitlab.
If there is an easier way please let me know!
Sorry for my bad english.
Another way of doing this (that doesn't involve any third server) is to use Patrick's method but first compress the file into a format that only uses visible ASCII charaters. That should make it work for any file, regardless of any whitespace characters or unusual encodings. I'd recommend base64 to do this.
Here's how I've done it:
Log onto your heroku instance using heroku run bash
Use base64 to print the contents of your file: base64 <your-file>
Select the base64 text in your terminal and copy it
On your local machine decompress this text using base64 straight into a new file (on a mac I'd do pbpaste | base64 --decode -o <your-file>)
I agree that most probably your need means a change in your application architecture, something like a worker dyno.
But by executing the following steps you can transfer the file, since heroku one-off dyno can run scp:
create vm in a cloud provider, e.g. digital ocean;
run heroku one-off dyno and create your file;
scp file from heroku one-off dyno to that vm server;
scp file from vm server to your local machine;
delete cloud vm and stop heroku one-off dyno.
I see that these answers are much older, so I'm assuming this is a new feature. For all those like me who are looking for an easier solution than the excellent answers already here, Heroku now has the capability to copy files quite easily with the following command: heroku ps:copy <filename>
Note that this works with relative paths, as you'd expect. (Tested on a heroku-18 stack, downloading files at "path/to/file.ext"
For reference: Heroku docs
Heroku dyno's come with sftp pre-installed. I tried git but was too many steps (had to generate a new ssh cert and add it to github every time), so now I am using sftp and it works great.
You'll need to have another host (like dreamhost, hostgator, godaddy, etc) - but if you do, you can:
sftp username#ftp.yourhostname.com
Accept the server fingerprint/hash, then enter your password.
Once on the server, navigate to the folder you want to upload to (using cd and ls commands).
Then use the command put filename.csv and it will upload it to your web host.
To retrieve your file: Use an ftp client like filezilla or hit the url if you uploaded to a folder in the www or website folder path.
This is great because it also works with multiple files and binaries as well as text files.
For small/quick transfers that fit comfortably in the clipboard:
Open a terminal on your local device
Run heroku run bash
(Inside your remote connection, on the dyno) Run cat filename
Select the lines in your local terminal and copy them to your clipboard.
Check to ensure proper newlines when pasting them.
Now i created shell script to upload some files from to git backup repo (for example, my app.db sqlite file is gitignored and every deploy kills it)
## upload dyno files to git via SSH session
## https://devcenter.heroku.com/changelog-items/1112
# heroku ps:exec
git config --global user.email 'dmitry.cheva#gmail.com'
git config --global user.name 'Dmitry Cheva'
rm -rf ./.gitignore
git init
## add each file separately (-f to add git ignored files)
git add app.db -f
git commit -m "backup on `date +'%Y-%m-%d %H:%M:%S'`"
git remote add origin https://bitbucket.org/cheva/appbackup.git
git push -u origin master -f
The git will reboot after the deploy and does not store the environment, you need to perform the first 3 commands.
Then you need to add files (-f for ignored ones) and push into repo (-f, because the git will require pull)