Downloading and Transferring Files Simultaneously with Limited EC2 Storage - bash

I am using an EC2 instance in AWS to run a bash script that downloads files from a server using a CLI while simultaneously moving them into S3 using the AWS CLI (aws s3 mv). However, I usually run out of storage before I can do this because the download speeds are faster than the transfer speeds to S3. The files, which are downloaded daily, are usually hundreds of GB and I do not want to upgrade storage capacity if at all possible.
The CLI I am using for the downloads runs continuously until success/fail but outputs statuses to the console (when I run it from command line instead of .sh) as it goes. I am looking for a way to theoretically run this script based on the specifications given. My most recent attempt was to use something long the lines of:
until (CLI_is_downloading) | grep -m 1 "download complete"; do aws s3 mv --recursive path/local_memory s3://path/s3; done
But that ran out of memory and the download failed well before the move was finished.
Some possible solutions that I thought of are to somehow run the download CLI until I reach a certain point of memory available before switching to the transfer and then alternating back and forth. Also, I am not too experienced with AWS so I am not sure this would work, but could I limit the download speed to match the transfer speed (like network throttling)? Any advice on the practicality of my ideas or other suggestions on how to implement this would be greatly appreciated.
EDIT: I checked my console output again and it seems that the aws s3 mv --recursive only moved the files that were currently there when the function was first called and then stopped. I believe if I called it repeatedly until I got my "files downloaded" message from my other CLI command, it might work. I am not sure exactly how to do this yet so suggestions would still be appreciated but otherwise, this seems like a job for tomorrow.

Related

How to correctly dockerize and continuously integrate 20GB raw data?

I have an application that uses about 20GB of raw data. The raw data consists of binaries.
The files rarely - if ever - change. Changes only happen if there are errors within the files that need to be resolved.
The most simple way to handle this would be to put the files in its own git repository and create a base image based on that. Then build the application on top of the raw data image.
Having a 20GB base image for a CI pipeline is not something I have tried and does not seem to be the optimal way to handle this situation.
The main reason for my approach here ist to prevent extra deployment complexity.
Is there a best practice, "correct" or more sensible way to do this?
Huge mostly-static data blocks like this are probably the one big exception to me to the “Docker images should be self-contained” rule. I’d suggest keeping this data somewhere else, and download it separately from the core docker run workflow.
I have had trouble in the past with multi-gigabyte images. Operations like docker push and docker pull in particular are prone to hanging up on the second gigabyte of individual layers. If, as you say, this static content changes rarely, there’s also a question of where to put it in the linear sequence of layers. It’s tempting to write something like
FROM ubuntu:18.04
ADD really-big-content.tar.gz /data
...
But even the ubuntu:18.04 image changes regularly (it gets security updates fairly frequently; your CI pipeline should explicitly docker pull it) and when it does a new build will have to transfer this entire unchanged 20 GB block again.
Instead I would put them somewhere like an AWS S3 bucket or similar object storage. (This is a poor match for source control systems, which (a) want to keep old content forever and (b) tend to be optimized for text rather than binary files.). Then I’d have a script that runs on the host that downloads that content, and then mount the corresponding host directory into the containers that need it.
curl -LO http://downloads.example.com/really-big-content.tar.gz
tar xzf really-big-content.tar.gz
docker run -v $PWD/really-big-content:/data ...
(In Kubernetes or another distributed world, I’d probably need to write a dedicated Job to download the content into a Persistent Volume and run that as part of my cluster bring-up. You could do the same thing in plain Docker to download the content into a named volume.)

How to enable networking before User Data scripts are run in AWS Windows instances

I have been struggling with bootstrapping my Windows instances in AWS. I need to download some things from S3 and other places when the instance starts up and execute them.
This seems to be really straightforward for Linux instances, but not so much for Windows instances.
I have a user data script that works when I run it after logging into the instance, but it doesn't work when it runs as part of the EC2Launch/EC2Config described here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-windows-user-data.html
I've found that the reason it doesn't work is that I am unable to download things from the internet due to: "The operation being requested was not performed because the user has not logged on to the network. The specified service does not exist."
Really what I'm trying to do is the following:
Download AWS_CLI Installer
Right now I'm using bitsadmin to try to download the installer from https://s3.amazonaws.com/aws-cli/AWSCLI64.msi
Install AWS_CLI
I am able to install the CLI with the CLI installer using msiexec and the /qn flags once it's on the box
Pull items from S3
This should be able to be done with aws s3 cp or sync
Install Python (installer stored in S3)
I am able to install the installer with msiexec once it's on the box similar to the CLI installer
Execute (python) scripts pulled from S3
Does anyone know of a better way to do this? This all works after I RDP in, but not as part of the Launch Script.
TL;DR - I need to download things using the EC2Launch User Data script provided to my EC2 Windows instances but the network doesn't seem to be available when it runs.
Thanks in advance for your help!
I Ran into a similar issue, I was using BITS to download an installer on a "user data" script and was getting the same error: "The operation being requested was not performed because the user has not logged on to the network. The specified service does not exist."
This not due to network issues, but rather to how BITS works, the user starting the download job must be logged on the machine, which it seems not to be the case for "user data" execution, which "runs as" the local Administrator without the user being logged in a way that BITS recognizes. From BITS documentation:
For BITS to detect that a user is logged on, the user must use one of the following interactive logon options:
Log on through the Welcome screen.
Log on to a terminal client.
Use fast user switching.
Starting with Windows 10, version, log on from another device using Remote Powershell.
During the script execution the network is mounted and ready to use, you can download your files using other methods that do not rely on BITS(e.g.: Invoke-WebRequest (rather slow for bigger files) or using System.Net.WebClient, both on PowerShell).

How to email time-out notice from Google Cloud Storage?

I have a gsutil script that that periodically backs up data to Google Could Storage.
The gsutil backup script runs on my local box.
I would like to run a script (or service) on Google Could Storage, that emails a warning to the administrator when no backup has been made in 24 hours.
I am new to cloud services. Please point me in the right direction.
Where would such a script be located? Is there a similar example script?
Thank you.
There's no built-in feature that accomplishes this. However, you could accomplish something like this with another monitor program.
For example, I might edit my backup script such that after successfully completing a backup, it writes the current time to a "last_successful_backup.txt" file. Then, I'd put a cronjob wherever I keep my monitors and alerting systems that would check the "last_successful_backup.txt" file every few hours and set off an alarm if the time it contains is older than 24 hours.
What about to spin up Google VM and send emails from the instance? Using, say, SendGrid, Mailgun, or Mailjet

Update extension for multiple files at once on Amazon S3

I'm having about 1 million files on my S3 bucket and unfortunately these files were uploaded with wrong extension. I need to add a '.gz' extension to every file in that bucket.
I can manage do that by using aws cli:
aws s3 mv bucket_name/name_1 bucket_name/name_1.gz
This works fine but the script is running so slow since it moves the file one by one, in my calculation it'll take up to 1 week, which is not acceptable.
I wonder if we have any better and faster way to achieve this goal ?
You can try S3 Browser which supports multi thread calls.
http://s3browser.com/
I suspect other tools can do multi thread as well, but the CLI doesn't.
There's no renaming feature for S3 files/bucket so you need to move or copy/delete files. If the files are big, it can indeed be a bit slow.
However there's nothing that prevents you to wait for a request to complete to continue with "renaming" the next file in your list, just process it.

Trouble Uploading Large Files to RStudio using Louis Aslett's AMI on EC2

After following this simple tutorial http://www.louisaslett.com/RStudio_AMI/ and video guide http://www.louisaslett.com/RStudio_AMI/video_guide.html I have setup an RStudio environment on EC2.
The only problem is, I can't upload large files (> 1GB).
I can upload small files just fine.
When I try to upload a file via RStudio, it gives me the following error:
Unexpected empty response from server
Does anyone know how I can upload these large files for use in RStudio? This is the whole reason I am using EC2 in the first place (to work with big data).
Ok so I had the same problem myself and it was incredibly frustrating, but eventually I realised what was going on here. The default home directory size for AWS is less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. An experienced linux user would not have fallen into this trap, but hopefully any other windows users new to this who come across this problem will see this. If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface. Whilst this is an advanced forum and this is a rookie error I am hoping no one deletes this question as I spent months on this and I think someone else will too. I hope this makes sense to you?
Don't you have shell access to your Amazon server? Don't rely on RStudio's upload (which may have a 2Gb limit, reasonably) and use proper unix dev tools:
rsync -avz myHugeFile.dat amazonusername#my.amazon.host.ip:
on your local PC command line (install cygwin or other unixy compatibility system) will transfer your huge file to your amazon server, and if interrupted will resume from that point, will compress the data for transfer too.
For a windows gui on something like this, WinSCP was what we used to do in the bad old days before Linux.
This could have something to do with your web server. Are you using nginx or apache as your web server. If so you can modify the upload feature in your nginx server. If you are running nginx on the front end of the web server I would recommend the following fix in your nginx.conf file.
http {
...
client_max_body_size 100M;
}
https://www.tecmint.com/limit-file-upload-size-in-nginx/
I had a similar problems with a 5GB file. What worked for me was to use SQLite to create a database with the csv file that I needed. Use SQLite code to bring create the database. Then I used a function in RStudio to communicate with the local database. In that way, I was able to bring in the csv file. I can track down the R code that I used if you like.

Resources