Caching for I/O intensive S3? - caching

I am writing a video messaging service and the videos will be stored on amazon S3. The nature of video messaging will involve a lot of writing and reading from the S3 storage. Basically as soon as it's written it will be read by another client. I am worried that S3 cannot keep up with the speed and will delay the message delivery time. I already have CloudFront CDN + S3 setup, I wonder if CloudFront is enough to serve as a cache or do I need to setup some sort of memcaching layer above the S3?

CloudFront + S3 should be enough, but do test your assumptions, use multipart upload and measure it all, as this guy did: http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
At the top, I was pushing more than one gigabyte of data to S3 every second - 1117,9 megs/sec to be precise. That is an awful lot of data, all coming from a single machine. Now imagine you scale this out over multiple machines, and you have the network infrastructure to support it - you can really send a lot of data.

Related

Where to store thousands of images in S3 or EFS of AWS?

I will make a project in the not too distant future, a project where we will be storing thousands of thousands of images in the course of time. I'm on a hard decision whether to use Amazon S3 or EFS to store those images. Both I think are a very good option, but my question goes to what would be the best service or what would be the best practice?
My application will be done with Laravel and I already did the integration of both services.
Most of the characteristics of the project are:
Most of the files I will store will be photos about 95%.
Approximately 1.5k photos would be stored daily.
The photos are very large (professional cameras).
Traffic to the application will not be much, approx. 100 users at a time.
Each user would consult about 100 photos per day.
What do you recommend?
S3 is absolutely the right answer and practice. I have built numerous applications like you describe, some with 100s of millions of images, and S3 is superior. It also allows for flexibility such as your API returning the images as pre-signed URLs which will reduce load to your servers, images can be linked directly via static web hosting, and it provides lifecycle policies to archive less used data. Additionally, further integration with other AWS services is easy using event triggers.
As for storing/uploading, S3 multi-part upload is very useful to both increase performance and increase reliability.
EFS would make sense for your type of scenario if you were doing some intensive processing where you had a cluster of severs that needed lower latency with a shared file system - think HPC. EFS would also come at a higher cost and doesn't provide as many extensibility options or built-in features as S3. Your scenario doesn't sound like it requires EFS.
http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/ShareObjectPreSignedURL.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
For the scenario you proposed AWS S3 is the choice. Why?
Since images are more often added, it costs roughly 1/10 th of EFS.
Less overhead on your web servers since files can be directly uploaded and downloaded with S3.
You can leverage event driven processing with Lambda e.g Generating thumbnail, Image processing filters by S3 Lambda trigger.
Higher level of SLA for availability and durability.
Supporting for inbuilt lifecycle management to archival and reduce cost.
AWS EFS can also be an option if it happens to frequently modify the images (Where EBS is also an option)
You can also consider using AWS CloudFront with either the option to cache images.
Note: At the end its not about using a single service. Based on your upcoming requrements you can choose either one of them or both.

Redis for caching image files?

I am using Amazon S3 for storing and retrieving images for an image storing website.
The trouble is that multiple users have to retrieve same image multiple times.
Is it suggested to use Redis or memcached for caching image files by storing them directly onto it.
Amazon S3 pricing for data transfer is much higher than compared to serving images via Redis cache. But storing image files directly on Redis seems to be a bad proposition because I read somewhere that Redis is not good for operating on large data files. Also I don't understand that if Redis stores data on memory how will it store so many images(unless I make many many instances).
Is it advisable to store image files directly onto Redis or is there an alternate for caching these images?
Do pinterest and imgur use Redis and memcache for storing images directly? If not why do they have so many instances?Pinterest
You get credit for creativity, but you have not found a loophole, here.
First, it's entirely inappropriate to try to serve images from elasticache. It's a cache. It's volatile by definition.
Second, it's not a web server.
Third, it's not intended to be exposed to the Internet.
But even if these aren't persuasive, your question seems premised on a misunderstanding of the pricing structure on several levels.
There is no Amazon ElastiCache Data Transfer charge for traffic in or out of the Amazon ElastiCache Node itself.
https://aws.amazon.com/elasticache/pricing/
Technically, this is accurate, but it is not helpful.
This is only relevant to the transfer from elasticache to your EC2 instance. You still have to return the data to the browser, across the Internet, and this costs the same, whether you return it from/through EC2, or from S3.
Data Transfer OUT From Amazon EC2 To Internet
Up to 10 TB / month $0.09 per GB
https://aws.amazon.com/ec2/pricing/
...or...
Data Transfer OUT From Amazon S3 To Internet
Up to 10 TB / month $0.090 per GB
https://aws.amazon.com/ec2/pricing/
Meanwhile, CloudFront is $0.085/GB for traffic sent to browsers that are accessing edge locations in the lowest price class, US and Europe. And you control which edge locations are available when you select a price class other than the global one:
If you choose a price class that does not include all edge locations ... you're charged the rate for the least expensive region in your selected price class.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PriceClass.html
That's $0.085 if configured correctly.
There is no charge for transfer from S3 to CloudFront or from EC2 to CloudFront. Only the charge from CloudFront to the Internet.

Which technology is recommended to use as Audio Streaming Web Server

I have a client side software which directly streams audio from a fileserver with public multimedia files exposed.
I'm using AWS S3 like web services, and I'm trying to maintain file hosting costs the lowest (Currently 0$). So any paid solution for data storage has been already reviewed.
The file collection size is really increasing. It might be close to 10TB of files during the next 12 months.
For now I manage around 250Gb of diverse quality mp3 files and images.
I would like to implement a server for streaming multimedia files, and I would like some advice in which server architecture/technology to use for this purpose (Hadoop, Nginx, ..)
First requisites might be:
good I/O management
handling many persistent and durable connections
for streaming.
The file security is not an issue in this question
Any help is welcome.
There's nothing special about audio files vs. any others for this use case. Any web server will do.
You're already using S3, just use that. S3 can serve your files directly, but with any decent load you're going to want to use CloudFront in front of your S3 bucket. CloudFront is a CDN that will distribute your media files from geographically distributed points, keeping things nice and fast for your users. It's also often cheaper to use CloudFront than S3 directly, when you have more traffic.

How can I improve the performance of this architecture?

I'm running a website that is CPU heavy due to a lot of thumbnailing of images.
This is how I currently do things:
User uploads image to server
Server keeps a copy, and stores the image on Amazon S3
When an thumbnail is requested, server uses the local copy to generate it, and then stores it on S3; then gives the S3 URL to the client
Subsequent requests are optimized like this: Server caches S3 URL in memcached, so it won't do the work again; server never generates a thumbnail again if the file exists; the server uses mid-sized thumbnails to generate small-sized one, so not to work with large files of not necessary
Now, I'm hosting on a Linode 4G instance (8 cores with 4x priority, 4GB RAM), and despite my optiomizations and having a memcached hit ratio of 70%, my average CPU is 170%. I'm constantly seeing all 8 CPUs working with frequent spikes of 100% for many of them at the same time.
I'm using nginx and gunicorn to serve a Django application, and the thumbnails are generated with PIL.
How can I improve this architecture?
I was thinking about a few possibilities:
#1. Easiest: add a second identical server with a load balancer in front, so that they'd share the load.
The problem with this is that the two servers would not share the local image cache. Could I solve this by placing such share on a network drive, or would the latency ultimately hinder the gains?
#2. A little harder: split the thumbnailing code out of my app, as a separate webservice, that would run on a second server. This way the main application and database would not suffer from high CPU usage, and the web pages would be served fast. The thumbnails are anyway already served asynchronously with JavaScript
Can anyone recommend some other solution?
Are you sure your performance problems come from thumbnails? OK, I suppose you've checked that.
You can downsize and upload the 2 thumbnails to S3 immediately (or shortly) after user uploaded the image. This way you should be able to save unnecessary CPU load you're now wasting for every HTTP request checking those thumbnails and doing IPC with memcached.
In a way your problem is a "good" problem to have (or at least it could have been a lot worse), in that there are no dependencies between separate image resizing tasks, so you can trivially distribute them over multiple servers. A few comments:
Have you checked to see if there is anything you can do to make the image resizing operations faster? (Google brought this up, don't know if it's any help: http://dmmartins.appspot.com/blog/speeding-up-image-resizing-with-python-and-pil) Even if you still find you need to add more servers, anything you can do to make each resize operation more efficient will make each server go farther.
If your users keep becoming more and more, you will eventually need to "scale out", but for the short term, it is possible you could solve the problem simply by paying another $80 for the next "tier" of service (8 cores at 8x priority).
Is image resizing really your app's only bottleneck? If image resizing was "free", how much further can you scale on your existing server before rendering pages, running DB queries, etc. would limit throughput? If you don't know, it would be good to do some simulated load testing and find out. I ask because if rendering pages, DB queries, etc. are also bottlenecks, or are soon to become bottlenecks, you are going to have to distribute the app anyways. In that case, you might as well keep thumbnailing in the main app, and distribute it right now, rather than making your thumbnailing run as a web service on a 2nd server.
Regardless of whether you distribute the main app, or split out thumbnailing into a separate app on a different server, you need some kind of authoritative store to keep track of where each thumbnail is kept on S3. You can keep that information in memcached, in a database, or wherever you want. It doesn't really matter. Even if you keep it in memcached, that doesn't mean you can't share the cache between 2 servers -- 1 server can connect to a memcached instance running on the other server.
You asked if "the latency" of checking a cache which is held on a different server will "hinder the gains". I don't think you need to worry about that. Your problem is throughput, not latency. Those high-latency network operations parallelize very well. So if you just service more requests in parallel, you can still make full use of your CPUs (which is the resource bottleneck right now).

Is Amazon S3 ever unavailable independent of EC2?

Currently, we are uploading all of our user-generated-content to a medium-size EC2 Instance, and then from there we run a cron job to sync all of the uploaded content to S3. We have some code that runs on the backend (every time you need to access any uploaded file) that checks to see whether or not the resource has been moved to S3, or if it is just available on our uploads instance.
This seems a little wasteful, but it does provide redundency -- if S3 is down, we have some javascript code in place that forces the files to be served from our upload box. The actual file uploads are stored in EBS, not on the instance.
We've got about 150GB worth of files in the S3 bucket right now; which makes performing a separate backup of the S3 Bucket extremely time consuming and nearly impossible to run on any sort of regular basis.
So, my question is, is this even necessary? Can anyone point me to some uptime statistics between S3 and EC2? Does it ever happen that S3 is down, but EC2 is available? It seems like it might be simpler to just upload everything directly to S3 and trust that it is up.... On the other hand, we could just store everything in EBS and forget S3 completely, which seems like it makes more sense.
It's much more likely that your EC2 instance will be down than S3 will be down. For one, you have a single instance running on a single host with a single network connection in a single availability zone. Past that, on a platform level, EC2 (particularly involving EBS) has had several protracted outages, whereas S3 has not had a significant availability event since 2008.
S3 is a distributed system spread all across your region of choice. Operating at the object level with eventual consistency guarantees is frankly a lot simpler than the problems addressed by EBS and EC2, all of which add additional consistency guarantees (and thus ways to fail) by design.
I generally make upload processes treat S3 as a backing store -- upload to S3 directly, or upload via an EC2 instance in a write-through fashion -- and accept that if S3 is down, then I can't handle uploads. Doing it this way introduces a failure mode where your app is running but S3 is not, but it significantly reduces the potential for data loss, which is usually a more serious problem than unavailability. This also allows you to simultaneously handle uploads via different EC2 instances in different availability zones, hedging against EC2 failures, as well as via instance-store instances, hedging against EBS failures.

Resources