Improving Data Transfer Rate on Amazon EC2 - amazon-ec2

I've got a 1-Gig EBS Volume mounted to an EC2 instance,
I am copying 600MB of binary data from a Local Hard Drive (via RDS Connection)
and the copying process windows is showing 10 Hours remaining.
Though I have a High Speed connection (100+Mbps)
Whatever the data volume, data transfer rate is 1 Min / MB (i.e 16Kbs / Sec)
I am hesitating between reading Moby Dick in front of my workstation or just taking a day-off.
Are there any reasonable options to speed up this transfer rate ? (Ideally 512 Kbps/Sec at the minimum)
I am very open to ANY solution to shorten up the uploading/downloading time to/from and EC2 instance.
Thanks in advance.
EDIT :
I just stumbled upon [Amazon Export/Import Service][1]
"AWS Import/Export accelerates transferring large amounts of data between the AWS cloud and portable storage devices that you mail to us" By "mail to us", they literally mean you "materially" shipping your storage device to Amazon. Don't say this is Stoneage, this is BRAND new TECHNOLOGY, Dude ! :-)
EDIT2 :
This sounded great: [Aspera for AWs][2] But unfortunately way too expensive; Tailored for Fortune 500 with big needs and big cash.

Sometimes, if you want something done it's better to do it yourself :-)
I did not find anything satisfying over the net so I spent the evening doing a rather Complete Bench Test of my own.
I have tested a few alternatives /scenarios and here are the results :
FTP server installed on EC2 was Filezilla Server (correct configuration is touchy)
FTP Client used for this Bench test was WinSPC (Filezilla Client didn't work. see this other post
Legend : [HC] stands for Home Connection (100MBps)
Upload Bandwidth
RDS Upload : 15 Kb/S => Worse Ever
FTP Upload [FTP Server Installed on EC2]: 100 Kb/S
Upload to S3 from AWS Management Console from HC: 60 Kb/S
Upload to S3 using AWS Console interface from EC2 : 145 Kb/S
Upload to S3 using S3 Browser from HC: 120 Kb/S
Upload to S3 using S3 Browser from EC2 : 2000 Kb/S
Download Bandwidth
RDS Download and Upload: 15 Kb/S => Worse Ever
FTP Download [FTP Server Installed on EC2]: 360 Kb/S
Download from S3 AWS Console interface from EC2 : 350 Kb/S
Download from S3 using S3 Browser: from HC: 380 to 620 Kb/S
Download from S3 using S3 Browser: from EC2 : 3000 Kb/S
Conclusions :
So, as of now, Amazon S3 combined with S3 Browser give the best results.
(S3 Browser is just a layer over S3 I don't get it why the upload rate is better)
However, one should keep in mind that an FTP Server on an EC2 instance has the great advantage of Mapping directly a local EC2 Directory into EC2. Unlike S3, there is only one transfer involved. Indeed, S3 requires 2 Transfers : Form Local Resource to S3/and from S3 to EC2 and the other way round, while FTP Access grants immediacy by shortening transfer cycles. Besides, it spares the cost of S3 Buckets.
It is also interesting to mention that EC2 Instance's bandwidth is really strong. It is hence far more interesting -needless to say, to use protocols that really take advantage of it such as S3 or FTP, rather than RDS.
I hope this will be useful to other people facing the same issue and spare them precious time.

Use Aspera or Tsunami UDP and move 600 MB to your Amazon EC2 infra (jump box). Then internally copy from jumpbox to windows ec2. My benchmark shows tsunami UDP is quite faster compared to traditional modes.

Related

Fastest way to transfer files to EC2 over Session Manager

I regularly need to move large files to and from an EC2 instance connected via Session Manager. File transfers within AWS are fast as are files between local machines and non AWS assets over our fiber connection.
However, upstream and downstream speeds with EC2 over Session Manager are really slow -- like around 1MB/s. I proxy ssh over Session Manager which allows me to use regular utilities to move things around. Is this a Session Manager thing, a function of how I'm using, it or something else?
If this is the best I can do, I'll have to deal with it, but I'd love to use a better way if there's one available.
I discovered exactly the same issue when using rsync and other file transfer tools via SSM. Uploads speeds to an EC2 instance that were ~15 MB/s when connecting directly (using its public IP, not using SSM) appeared limited to between 300 and 800 KB/s when going via SSM.
I contacted AWS support for clarifications, and their response included:
"After discussing this situation with our SSM service team, they have mentioned that there will be some delay in SCP over Session Manager compared to direct SCP as there are extra hops in communication in SCP via SSM. Apart from the extra hops, there are other limits imposed in this feature which controls the rate of packet transfer and size of packet. These restrictions are placed to prevent misuse on the feature.
Therefore, there is not a way to mitigate this speed limitation you have encountered due to this."
This Github issue from 2019 on the aws-ssm-agent repo indicates slow performance which they claimed was resolved, but it seems they do not expect users to manage large file uploads/downloads via SSM.

laravel queue in multi servers

I am going to make UGC project. I have a Main Server (Application) & 4 Encoder Servers (fro converting videos) and a Storage server (for hosting videos).
I want to use database driver in laravel queue and my target database is jobs. for each uploaded videos I have 5 certain jobs that convert video to 240p, 360p, 480p, 720p & and 1080p.
but jobs Does not specify belongs to which Encoder servers. for example a video uploaded in Encoder Server #4 , but Encoder Server #2 try to start job and get failed because files are in Encoder Server #4
how can I solve this chanlenge?
As #apokryfos says, upload the file to a shared storage like Amazon S3 / Google Cloud Storage/ Digital Ocean spaces) .. whatever..
Have the processing job download it from a central store, process it, and upload the result to another central storage.
If you bind a single job to a single worker, (encoder server) as you call it,
It does not make any sense, and it will never be scalable, and are pretty much doomed, to run into issues.
Doing it this way, you can just scale up the number of workers, once you need it.
you could even auto scale them.
Consider using a kubernetes deployment, to allow easy (auto) scaling.

Transfer large number of large files to s3

I am transferring around 31 TB of data that consists of 4500 files, file sizes range from 69MB to 25GB, from a remote server to a s3 bucket. I am using s4cmd put to do this and put it in a bash script upload.sh:
#!/bin/bash
FILES="/path/to/*.fastq.gz"
for i in $FILES
do
echo "$i"
s4cmd put --sync-check -c 10 $i s3://bucket-name/directory/
done
Then I use qsub to submit the job:
qsub -cwd -e error.txt -o output.txt -l h_vmem=10G -l mem_free=8G -l m_mem_free=8G -pe smp 10 upload.sh
This is taking way too long - it took 10 hours to upload ~20 files. Can someone suggest alternatives or modifications to my command?
Thanks!
Your case may belong to the situation when copying the data onto physical media and shipping it by regular mail is faster and cheaper than transferring the data over the internet. AWS supports such a "protocol" and has a special name for it - AWS Snowball.
Snowball is a petabyte-scale data transport solution that uses secure
appliances to transfer large amounts of data into and out of the AWS
cloud. Using Snowball addresses common challenges with large-scale
data transfers including high network costs, long transfer times, and
security concerns. Transferring data with Snowball is simple, fast,
secure, and can be as little as one-fifth the cost of high-speed
Internet.
With Snowball, you don’t need to write any code or purchase any
hardware to transfer your data. Simply create a job in the AWS
Management Console and a Snowball appliance will be automatically
shipped to you*. Once it arrives, attach the appliance to your local
network, download and run the Snowball client to establish a
connection, and then use the client to select the file directories
that you want to transfer to the appliance. The client will then
encrypt and transfer the files to the appliance at high speed. Once
the transfer is complete and the appliance is ready to be returned,
the E Ink shipping label will automatically update and you can track
the job status via Amazon Simple Notification Service (SNS), text
messages, or directly in the Console.
* Snowball is currently available in select regions. Your location will be verified once a job has been created in the AWS Management
Console.
The capacity of their smaller device is 50TB, a good fit for your case.
There is also a similar service AWS Import/Export disk, where you ship your own hardware (hard drives), instead of their special device:
To use AWS Import/Export Disk:
Prepare a portable storage device (see the Product Details page for supported devices).
Submit a Create Job request. You’ll get a job ID with a digital signature used to authenticate your device.
Print out your pre-paid shipping label.
Securely identify and authenticate your device. For Amazon S3, place the signature file on the root directory of your device. For
Amazon EBS or Amazon Glacier, tape the signature barcode to the
exterior of the device.
Attach your pre-paid shipping label to the shipping container and ship your device along with its interface connectors, and power supply
to AWS.
When your package arrives, it will be processed and securely
transferred to an AWS data center, where your device will be attached
to an AWS Import/Export station. After the data load completes, the
device will be returned to you.

Azure - How do I increase performance on the same blob by 3000 - 18,000 users at the same time?

Azure - How do I increase performance on the same single blob download for 3000 - 18,000 clients all downloading in a 5 minute range? (Can't use CDN because we need the files to be private with SAS).
Requirements:
We can't use CDN because the file or "blob" needs to be private. We’ll generate SAS keys on all the simultaneous download requests.
The files/blobs will be the encrypted exams uploaded 24 or 48 hours before an Exam start time.
-3000 - 18,000 downloads at the same start time in a 5- 10 minute window before the Exam start time.
172 – 1000 blobs. Sizes (53 K Byte – 10 M byte ).
We have a web service that verifies the students info, pin, exam date/time are correct. If correct, generates the URI & SAS.
Azure site said only 480 Mbit/s for a single blob.
But another part of Azure site mentions as high as 20,000 trans/sec # 20 Mbit/sec.
Ideas?
Would snapshot of the blob help?
I thought a snapshot is only helpful if you know the source blob is being updated during a download?
Would premium help?
I read premium just means it’s stored on a SSD for more $) But we need more bandwidth and many clients hitting the same blob.
Would creating say 50 copies of the same Exam help?
Then rotate each client browser through each copy of the file.
Listed on AZURE FORUMS
https://social.msdn.microsoft.com/Forums/azure/en-US/7e5e4739-b7e8-43a9-b6b7-daaea8a0ae40/how-do-i-increase-performance-on-the-same-single-blob-download-for-3000-18000-clients-all?forum=windowsazuredata
I would cache the blobs in memory using a Redis Cache instead of using the blobs as the source. In Azure you can launch a Redis Cache of the appropriate size for your volume. Then you are not limited by the blob service.
When the first file is requested
1. check the Redis-cache for the file.
a.Found - Serve the file from the cache.
b.Not Found - Get the file from the blob and put in the cache. Serve the file.
Next request will use the file from the cache, freeing up the azure blob storage.
This is better than duplicating the file on blob storage since you can set an expire time in the Redis cache and it will clean itself up.
https://azure.microsoft.com/en-us/documentation/articles/cache-configure/
Duplication. Rather than rotating though, give the client a list and have them pick randomly. That will also let them fall back to another server if the first request fails.
You can use SAS keys with the CDN, assuming that you will be using the same SAS key for all users and that you aren't generating a unique SAS for each user. If you are expecting the users to come within a 5-10 minute window then you could generate a single 15 minute SAS and use that with the CDN. Just make sure you also set the cache TTL on the blob to the same duration that the SAS specifies because the CDN won't actually validate the SAS permissions (blob storage will validate it any time the CDN has to fetch the object from origin). See Using Azure CDN with Shared Access Signatures for more information.
Jason's suggestion of using multiple blobs is also a good solution.
You could also spin up several Webrole instances and host the file locally on those instances, then instead of sending users a SAS URL (which could be used by non-authorized users) you could actually authenticate the user and serve the file directly from the Webrole. If all of your traffic will be within a ~10 minute window then you could spin up hundreds of instances and still keep the cost very low.

Why is a download manager required to utilize full download speed available via isp from computer in california accessing ec2 instance in virginia?

So far I get an average of 700 kilobytes per second for downloads via chrome hitting an ec2 instance in virginia (us-east region). If I download directly from s3 in virginia (us-east region) I get 2 megabytes per second.
I've simplified this way down to simply running apache and reading a file from a mounted ebs volume. Less than one percent of the time I've seen the download hit around 1,800 kilobytes per second.
I also tried nginx, no difference. I also tried running a large instance with 7GB of Ram. I tried allocating 6GB of ram to the jvm and running tomcat, streaming the files in memory from s3 to avoid the disk. I tried enabling sendfile in apache. None of this helps.
When I run from apache reading from the file system, and use a download manager such as downthemall, I always get 2 megabytes per second when downloading from an ec2 instance in virginia (us-east region). It's as if my apache is configured to only allow 700 megabytes per thread. I don't see any configuration options relating to this though.
What am I missing here? I also benchmarked dropbox downloads as they use ec2 as well, and I noticed I get roughly 700 kilobytes per second there too, which is way slow as well. I imagine they must host their ec2 instances in virginia / us-east region as well based in the speed. If I use a download manager to download files from dropbox I get 2 megabytes a second as well.
Is this just the case with tcp, where if you are far away from the server you have to split transfers into chunks and download them in parrallel to saturate your network connection?
I think your last sentence is right: your 700mbps is probably a limitation of a given tcp connection ... maybe a throttle imposed by EC2, or perhaps your ISP, or the browser, or a router along the way -- dunno. Download managers likely split the request over multiple connections (I think this is called "multi-source"), gluing things together in the right order after they arrive. Whether this is the case depends on the software you're using, of course.

Resources