I am using Amazon S3 for storing and retrieving images for an image storing website.
The trouble is that multiple users have to retrieve same image multiple times.
Is it suggested to use Redis or memcached for caching image files by storing them directly onto it.
Amazon S3 pricing for data transfer is much higher than compared to serving images via Redis cache. But storing image files directly on Redis seems to be a bad proposition because I read somewhere that Redis is not good for operating on large data files. Also I don't understand that if Redis stores data on memory how will it store so many images(unless I make many many instances).
Is it advisable to store image files directly onto Redis or is there an alternate for caching these images?
Do pinterest and imgur use Redis and memcache for storing images directly? If not why do they have so many instances?Pinterest
You get credit for creativity, but you have not found a loophole, here.
First, it's entirely inappropriate to try to serve images from elasticache. It's a cache. It's volatile by definition.
Second, it's not a web server.
Third, it's not intended to be exposed to the Internet.
But even if these aren't persuasive, your question seems premised on a misunderstanding of the pricing structure on several levels.
There is no Amazon ElastiCache Data Transfer charge for traffic in or out of the Amazon ElastiCache Node itself.
https://aws.amazon.com/elasticache/pricing/
Technically, this is accurate, but it is not helpful.
This is only relevant to the transfer from elasticache to your EC2 instance. You still have to return the data to the browser, across the Internet, and this costs the same, whether you return it from/through EC2, or from S3.
Data Transfer OUT From Amazon EC2 To Internet
Up to 10 TB / month $0.09 per GB
https://aws.amazon.com/ec2/pricing/
...or...
Data Transfer OUT From Amazon S3 To Internet
Up to 10 TB / month $0.090 per GB
https://aws.amazon.com/ec2/pricing/
Meanwhile, CloudFront is $0.085/GB for traffic sent to browsers that are accessing edge locations in the lowest price class, US and Europe. And you control which edge locations are available when you select a price class other than the global one:
If you choose a price class that does not include all edge locations ... you're charged the rate for the least expensive region in your selected price class.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PriceClass.html
That's $0.085 if configured correctly.
There is no charge for transfer from S3 to CloudFront or from EC2 to CloudFront. Only the charge from CloudFront to the Internet.
Related
I will make a project in the not too distant future, a project where we will be storing thousands of thousands of images in the course of time. I'm on a hard decision whether to use Amazon S3 or EFS to store those images. Both I think are a very good option, but my question goes to what would be the best service or what would be the best practice?
My application will be done with Laravel and I already did the integration of both services.
Most of the characteristics of the project are:
Most of the files I will store will be photos about 95%.
Approximately 1.5k photos would be stored daily.
The photos are very large (professional cameras).
Traffic to the application will not be much, approx. 100 users at a time.
Each user would consult about 100 photos per day.
What do you recommend?
S3 is absolutely the right answer and practice. I have built numerous applications like you describe, some with 100s of millions of images, and S3 is superior. It also allows for flexibility such as your API returning the images as pre-signed URLs which will reduce load to your servers, images can be linked directly via static web hosting, and it provides lifecycle policies to archive less used data. Additionally, further integration with other AWS services is easy using event triggers.
As for storing/uploading, S3 multi-part upload is very useful to both increase performance and increase reliability.
EFS would make sense for your type of scenario if you were doing some intensive processing where you had a cluster of severs that needed lower latency with a shared file system - think HPC. EFS would also come at a higher cost and doesn't provide as many extensibility options or built-in features as S3. Your scenario doesn't sound like it requires EFS.
http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/ShareObjectPreSignedURL.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
For the scenario you proposed AWS S3 is the choice. Why?
Since images are more often added, it costs roughly 1/10 th of EFS.
Less overhead on your web servers since files can be directly uploaded and downloaded with S3.
You can leverage event driven processing with Lambda e.g Generating thumbnail, Image processing filters by S3 Lambda trigger.
Higher level of SLA for availability and durability.
Supporting for inbuilt lifecycle management to archival and reduce cost.
AWS EFS can also be an option if it happens to frequently modify the images (Where EBS is also an option)
You can also consider using AWS CloudFront with either the option to cache images.
Note: At the end its not about using a single service. Based on your upcoming requrements you can choose either one of them or both.
I am writing a video messaging service and the videos will be stored on amazon S3. The nature of video messaging will involve a lot of writing and reading from the S3 storage. Basically as soon as it's written it will be read by another client. I am worried that S3 cannot keep up with the speed and will delay the message delivery time. I already have CloudFront CDN + S3 setup, I wonder if CloudFront is enough to serve as a cache or do I need to setup some sort of memcaching layer above the S3?
CloudFront + S3 should be enough, but do test your assumptions, use multipart upload and measure it all, as this guy did: http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
At the top, I was pushing more than one gigabyte of data to S3 every second - 1117,9 megs/sec to be precise. That is an awful lot of data, all coming from a single machine. Now imagine you scale this out over multiple machines, and you have the network infrastructure to support it - you can really send a lot of data.
Current situation
I have a Java Tomcat application running on ElasticBeanstalk. The application is a webservice that receives search queries and returns the results in Xml format. The webservice is only updated with new data once a month so any query sent at the end of the month will return identical results to one returned at the start of the month.
We take advantage of EB's load balancing so that usually just one EC2 instance is running but at time of peaks usage and another EC2 instance may get started.
To allow deployment of new versions Elastic Beanstalk we have a domain name on Route53, and a subdomain mapped to the the EB Application, customers use this subdomain in order to use the webservice.
This is working reasonably well, except peak usage can be somewhat higher than normal usage resulting in need more instances to be started increasing cost but also a slower response rate even with the extra machine.
Should I use CloudFront
I was wondering if I could use CloudFront to cache these responses, Im making these assumption
There would be less peaks and troughs on EB
I would save me money assuming cloudfront requests cheaper then extra load on EB
It would improve response rate for customers not near my EB server, i.e EB server is based in EU but I have many US customers.
If so how do I do it
I went to try and create a Cloudfront Distribution but in the Original Domain Name field it only listed my s3 buckets not my S3 domain so havent gone any further.
I always put cloudfront in front of any solutions I deliver on AWS. In response to your specific questions:
Most likely yes, it would off load some of the work that may go to an EC2 instance so it might prevent an extra instance from spinning up sometimes.
Maybe, maybe not. It might save you money, but its also possible that it could end up costing you a fortune. Cloudfront can be abused by a hacker, if for no other reason than to give you a huge bill, so you may want to add a billing alert so you are not suprised by this.
Yes, it all likelihood it will improve the responsiveness of you websites. Thats the prime reason I always use it.
CloudFront does allow you to serve dynamic content: http://aws.amazon.com/cloudfront/dynamic-content/ however from reading there it seems that it would cache query results based on a URL pattern. Would that be compatible with your site's use?
Information on how to specify an EC2 as a CloudFront origin can be found here: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/CustomOriginBestPractices.html
I'm using an Azure CDN endpoint on a hosted service (meaning, not a Blob Storage CDN endpoint).
The service is lazy rendering images, and once they are rendered, they are practically static (I can safely use Cache-Control:public, max-age=31536000).
In the naive implementation, there will be up to** X misses (X times the service will render an image) - Where X is the number of CDN nodes around the world.
There are two workarounds, as I see it:
The lazy created images are stored in Blob Storage, and later pulled from there.
Implement a cache in the Cloud Service.
Is there a way to propagate files to all nodes? Is there a better solution then having two caching layers (Cloud Service Cache / Blob Storage + CDN)?
** "Up to", depending on the geographical location of web requests. In my case, all around the world.
There is currently not a way for you to push something to one of the remote CDN nodes. This is a feature that many folks have asked Microsoft for in the CDN product.
Both workarounds would work. The first one benefits from not having to recreate them for the other CDN notes at all and will reduce the load on the server since it wouldn't be feeding these up after it renders them the first time. However, if it turns into you needing to get the request at the server anyway and then redirect to a version already in BLOB storage then you could easily just return the cached image as well. I think it depends on how many images you are talking about. If you have a LOT of them I'd lean more toward the first option.
Currently, we are uploading all of our user-generated-content to a medium-size EC2 Instance, and then from there we run a cron job to sync all of the uploaded content to S3. We have some code that runs on the backend (every time you need to access any uploaded file) that checks to see whether or not the resource has been moved to S3, or if it is just available on our uploads instance.
This seems a little wasteful, but it does provide redundency -- if S3 is down, we have some javascript code in place that forces the files to be served from our upload box. The actual file uploads are stored in EBS, not on the instance.
We've got about 150GB worth of files in the S3 bucket right now; which makes performing a separate backup of the S3 Bucket extremely time consuming and nearly impossible to run on any sort of regular basis.
So, my question is, is this even necessary? Can anyone point me to some uptime statistics between S3 and EC2? Does it ever happen that S3 is down, but EC2 is available? It seems like it might be simpler to just upload everything directly to S3 and trust that it is up.... On the other hand, we could just store everything in EBS and forget S3 completely, which seems like it makes more sense.
It's much more likely that your EC2 instance will be down than S3 will be down. For one, you have a single instance running on a single host with a single network connection in a single availability zone. Past that, on a platform level, EC2 (particularly involving EBS) has had several protracted outages, whereas S3 has not had a significant availability event since 2008.
S3 is a distributed system spread all across your region of choice. Operating at the object level with eventual consistency guarantees is frankly a lot simpler than the problems addressed by EBS and EC2, all of which add additional consistency guarantees (and thus ways to fail) by design.
I generally make upload processes treat S3 as a backing store -- upload to S3 directly, or upload via an EC2 instance in a write-through fashion -- and accept that if S3 is down, then I can't handle uploads. Doing it this way introduces a failure mode where your app is running but S3 is not, but it significantly reduces the potential for data loss, which is usually a more serious problem than unavailability. This also allows you to simultaneously handle uploads via different EC2 instances in different availability zones, hedging against EC2 failures, as well as via instance-store instances, hedging against EBS failures.