Download a file from the Internet directly to my S3 bucket - hadoop

I'm working with EMR (Elastic MapReduce) on AWS infrastructure and the default way to provide input files (large datasets) for programs is to upload them to an S3 bucket and reference those buckets from within EMR.
Usually I download the datasets to my local,development machine and then upload them to S3, but this is getting harder to do with larger files, as upload speeds are generally much lower than download speeds.
My question is is there a way to download files from the internet (given their URL) directly into S3 so I don't have to download them to my local machine and then manually upload them?

No. You need an intermediary- typically, an EC2 instance is used, rather than your local machine, for speed.

Related

AWS temporary files before uploading to S3?

My Laravel app allows users to upload images. Currently, when the user uploads their images, they are stored in a temporary location on the server. A cron job then modifies the uploaded images (compresses them, etc.), and uploads them to S3. Any temporary files older than 48 hours that failed to upload to S3 are deleted by another cron job.
I've set up an Elastic Beanstalk environment, but it's occurred to me that storing uploaded images in a temporary directory on an instance is risky because instances can be created and destroyed when necessary.
How and where, then, would I store these temporary files so that they're not at risk of being deleted by an instance?
As discussed in the comments, I think that uploading the file to S3 is the best option. As far as I know, it's not possible to stop Elastic Beanstalk from destroying an ec2 instance, unless you want to get rid of all of the scaling and instance failure/autoreplacement features.
One option I don't know much about may be AWS EBS. "Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud." I don't have any direct experience with EBS, the overriding question of course would be if EBS is truly persistent, even after an ec2 instance is destroyed. As EBS has costs associated with it, it seems like since you are already using S3, S3 would be the way to go.
S3 has a feature called object lifecycle management you can use to have files deleted automatically by setting them them to expire 2 days after they're uploaded.
You can either:
A) Prefix the temporary files to put them in an S3 psuedo-folder (i.e., Temp/), apply the object lifecycle expire rule to that specific prefix (or "folder"), and use the files in there as a source of truth for the new files derived from it post-manipulation.
or
B) Create an S3 bucket specifically for temporary files. Manipulate the files from there and copy to the production bucket.

sync EBS volumes via S3

I am looking to have multiple Amazon EC2 instances use the same data store. Amazon does not condone mounting an S3 Bucket as a file system, so I am trying to avoid this solution. Is there a way to synchronize an EBS volume with S3 or would it be best to use rsync and cron?
Do you really have to have the files locally available from within EBS? What if instead you served them to yourself via CloudFront, and restricted the permissions so that only your instances (or only your Security Group) could see the files?
Come Fall 2015 you'll be able to use Elastic File Storage (EFS) for this. But until then, I would suppose the next best thing is to use the AWS command-line to sync down from S3 to your volume:
aws s3 sync s3://my-bucket/my/folder /mnt/my-ebs/
After the initial run, that sync command is surprisingly fast. So from there you could just cron it to run hourly or so?

Download speed from S3 and EC2 storage, request from Node js and browser

Me and my team are creating a mobile game in which a map is available. We store json information in multiple files - each file represents a tile on the map. To render the map, we download the files and process them to create streets, buildings etc.
I want to choose the best way to download tile files to the mobile devices but I didn't have the possibility to do this test on the mobile devices so i used a browser and node js scripts.
I used a 100KB json file. Uploaded it on an S3 bucket and on EC2 storage. I wrote a few node scripts to connect to the S3 or EC2:
GET request from Node js local script to S3 bucket (bucket.zone.amazonaws.com/file) - ~650ms
GET request from Node js local script to Node js server run on EC2 instance which connects to S3 - ~1032ms
GET request from Node js local script to Node js server run on EC2 instance which loads the file from local storage - ~833ms
The difference between the last two values is actually the time added for EC2 instance to access the file from the bucket. And the reason for making a request to S3 from EC2 is that I know that connections between AWS services is really fast.
The other test I made was from the browser (Firefox):
Directly accessed the S3 bucket (bucket.zone.amazonaws.com/file) - ~624ms with values between 400ms and 1000ms
Through the Apache server on EC2 (domain/file) - ~875ms with values between 649ms and 1090ms
Through Node js server which connects to S3 bucket (run on EC2) (domain:port) - ~1014ms with values between 680ms and 1700ms
Through Node js server which loads the file from local storage (run on EC2) (domain:port) - ~ 965ms with values between 600ms and 1700ms
My question is why is such a big difference between accessing the file from the browser and accessing it through Node script?
For logging the times, I made each request for 10 times and I did the mean of times.
EC2 instance is micro, in Ireland. The bucket is situated in Ireland too.
I propose several clue that may help you to profile.
Cache, when you use script to get the json data. The cache mechanism would not work. While in browser, it will honor the cache header and may fetch from the cache, thus decrease the speed.
GZip header, I think you would not enable gzip to compress your data in nodejs server. I am not sure if you configured such on Apache. Imagine a json file with 100k, if it is compressed, the transfer time would surely be decreased.
Thanks
So i think this question has no more sense since the times are more even after hard refreshing the page.

Amazon S3 multipart upload often fails

I'm trying to upload a 32GB file to a S3 bucket using the s3cmd CLI. It's doing a multipart upload and often fails. I'm doing this from a server which has a 1000mbps bandwidth to play with. But the upload still is VERY slow. Is there something I can do to speed this up?
On the other hand, the file is on the HDFS on the server I mentioned. Is there a way to reference the Amazon Elastic Map Reduce job to pick it up from this HDFS? It's still an upload but the job is getting executed as well. So the overall process is much quicker.
First I'll admit that I've never used the Multipart feature of s3cmd, so I can't speak to that. However, I have used boto in the past to upload large (10-15GB files) to S3 with a good deal of success. In fact, it became such a common task for me that I wrote a little utility to make it easier.
As for your HDFS question, you can always reference an HDFS path with a fully qualified URI, e.g., hdfs://{namenode}:{port}/path/to/files. This assumes your EMR cluster can access this external HDFS cluster (might have to play with security group settings)

How to rsync to all Amazon EC2 servers?

I have a Scalr EC2 cluster, and want an easy way to synchronize files across all instances.
For example, I have a bunch of files in /var/www on one instance, I want to be able to identify all of the other hosts, and then rsync to each of those hosts to update their files.
ls /etc/aws/hosts/app/
returns the IP addresses of all of the other instances
10.1.2.3
10.1.33.2
10.166.23.1
Ideas?
As Zach said you could use S3.
You could download one of many clients out there for mapping drives to S3. (search for S3 and webdav).
If I was going to go this route I would setup an S3 bucket with all my shared files and use jetS3 in a cronJob to sync each node's local drive to the bucket (pulling down S3 bucket updates). Then since I normally use eclipse & ant for building, I would create a ANT job for deploying updates to the S3 bucket (pushing updates up to the S3 bucket).
From http://jets3t.s3.amazonaws.com/applications/synchronize.html
Usage: Synchronize [options] UP <S3Path> <File/Directory>
(...)
or: Synchronize [options] DOWN
UP : Synchronize the contents of the Local Directory with S3.
DOWN : Synchronize the contents of S3 with the Local Directory
...
I would recommend the above solution, if you don't need cross-node file locking. It's easy and every system can just pull data from a central location.
If you need more cross-node locking:
An ideal solution would be to use IBM's GPFS, but IBM doesn't just give it away (at least not yet). Even though it's designed for High Performance interconnects it also has the ability to be used over slower connections. We used it as a replacement for NFS and it was amazingly fast ( about 3 times faster than NFS ). There maybe something similar that is open source, but I don't know. EDIT: OpenAFS may work well for building a clustered filesystem over many EC2 instances.
Have you evaluated using NFS? Maybe you could dedicate one instance as an NFS host.

Resources