Can Azure CDN propogate changes to all nodes with just one "miss"? - performance

I'm using an Azure CDN endpoint on a hosted service (meaning, not a Blob Storage CDN endpoint).
The service is lazy rendering images, and once they are rendered, they are practically static (I can safely use Cache-Control:public, max-age=31536000).
In the naive implementation, there will be up to** X misses (X times the service will render an image) - Where X is the number of CDN nodes around the world.
There are two workarounds, as I see it:
The lazy created images are stored in Blob Storage, and later pulled from there.
Implement a cache in the Cloud Service.
Is there a way to propagate files to all nodes? Is there a better solution then having two caching layers (Cloud Service Cache / Blob Storage + CDN)?
** "Up to", depending on the geographical location of web requests. In my case, all around the world.

There is currently not a way for you to push something to one of the remote CDN nodes. This is a feature that many folks have asked Microsoft for in the CDN product.
Both workarounds would work. The first one benefits from not having to recreate them for the other CDN notes at all and will reduce the load on the server since it wouldn't be feeding these up after it renders them the first time. However, if it turns into you needing to get the request at the server anyway and then redirect to a version already in BLOB storage then you could easily just return the cached image as well. I think it depends on how many images you are talking about. If you have a LOT of them I'd lean more toward the first option.

Related

Why it is recommended to store images on remote server?

Sorry for such a question, but I can not find any article on the web with cons on that, I guess it is about async uploading and downloading, but it's just a guess, is there somewhere a detailed info?
It's mostly about specialization, data locality, and concurrency.
Servers that are specialized at serving static content typically do so much faster than dynamic web servers (the web servers are optimized for the specific use-case).
You also have the advantage of storing your content in many zones to achieve better performance (the content is physically closer to the person requesting it), where-as web applications typically should be near its other dependencies, such as databases.
Lastly browsers (for http/1 at least) only allow a fixed number of connections per server, so if your images and api calls are on separate servers, one cannot influence the other in terms of request scheduling.
There are a lot of other reasons I'm sure, but these are just off the top of my head.

Are CDNs that don't cache useful?

Is there any benefit to (HTTP-) serving a non-cacheable resource over a CDN?
(my use case: I'm serving a static Single Page App and I'd like to improve its load time, but I don't want index.html to get cached, because I want every new release to be reflected immediately. Specifically, this static site is hosted on AWS S3, and the CDN is AWS CloudFront.)
I assume that most of the performance benefits of CDNs are achieved through caching, but I could imagine other benefits due to, say, priviledged network infrastructure. As I don't know the first thing about networks, this may sound like a silly question.
Yes, it can be useful by moving the content closer to the user. Most CDN's will serve your static file from a geographical location as close to the user as possible, typically providing better latency.
Of course, you need to have users across the globe for this to make sense to you.

A better solution to host static files besides Amazon S3

I made a mobile application in static html, which is equal to my site wordpress site
The first version was completely static, all texts were in the mobile HTML application.
Today, I updated my application to pull data from the wordpress with AJAX.
The problem is that now, with so many requests being made, the S3 bucket is not being enough.
Despite having decreased from 6kb to 83kb, but it is still more slow because of AJAX..
is it possible put static applications in some other service from Amazon?
For the static content, you should probably be looking at AWS CloudFront instead of S3. As per the page itself:
Amazon CloudFront is a content delivery web service. It integrates with other Amazon Web Services products to give developers and businesses an easy way to distribute content to end users with low latency, high data transfer speeds, and no minimum usage commitments.
Other thing you can leverage is the AJAX caching. That will make your webpage load much faster from the next time. You may also want to using nginx on your server for caching (this will reduce your server load)

Amazon AWS and usage model for S3 storage

There is this example on amazon, a high traffic web application. I noticed that they are using S3 as their content delivery method. I was wondering if I need to have a Web Server for the content delivery and a Web App for my application. I don't understand why they have 2 web servers and 2 web app in the diagram.
And what is the best way to set up a website that serves images and static contents through S3 and the rest of the content through the regular storage.
My last question is, can I consider S3 as a main storage, reliable enough that I can only keep my static content there and don't have a normal storage as a backup ?
That is a very general diagram, specific diagrams will vary depending on the specifics of the overall architecture.
Having said that, I believe the Web Server represents something like Apache or Nginx and the App Server represent something like Rails, Rack Server, Unicorn, Gunicorn, Django, Sinatra, Flask, Jetty, Tomcat, etc. In some cases you can merge the Web Server and the App Server together like for example deploying Apache with python mod_wsgi to run your Django app. (So depends on Architecture)
what is the best way to set up a website that serves images and static
contents through S3 and the rest of the content through the regular
storage.
There's no really best way other than just point your dynamic content to your Databases (SQL and NoSQL) and point your static files to an S3 bucket (images, css, Jquery code, etc) You can also use third party modules depending on your application stack. For example you can accomplish this in Django with the django-storages module. You can find similar modules for other app stacks like Rails.
My last question is, can I consider S3 as a main storage, reliable
enough that I can only keep my static content there and don't have a
normal storage as a backup ?
S3 is pretty reliable, they provide a 99.999999999% reliability of your data. That goes down if you use their RRS (Reduced Redundancy Storage), but if you want to use it you probably want to back up your data in a non RRS bucket anyways. Anyhow, if it's extremely critical data, you are more than free to backup your data somewhere else just in case.
Notice in the diagram that they also recommend using CloudFront for your static files and this is especially useful if your users will be accessing your application from different geographical areas.
Hope this helps.

How can I improve the performance of this architecture?

I'm running a website that is CPU heavy due to a lot of thumbnailing of images.
This is how I currently do things:
User uploads image to server
Server keeps a copy, and stores the image on Amazon S3
When an thumbnail is requested, server uses the local copy to generate it, and then stores it on S3; then gives the S3 URL to the client
Subsequent requests are optimized like this: Server caches S3 URL in memcached, so it won't do the work again; server never generates a thumbnail again if the file exists; the server uses mid-sized thumbnails to generate small-sized one, so not to work with large files of not necessary
Now, I'm hosting on a Linode 4G instance (8 cores with 4x priority, 4GB RAM), and despite my optiomizations and having a memcached hit ratio of 70%, my average CPU is 170%. I'm constantly seeing all 8 CPUs working with frequent spikes of 100% for many of them at the same time.
I'm using nginx and gunicorn to serve a Django application, and the thumbnails are generated with PIL.
How can I improve this architecture?
I was thinking about a few possibilities:
#1. Easiest: add a second identical server with a load balancer in front, so that they'd share the load.
The problem with this is that the two servers would not share the local image cache. Could I solve this by placing such share on a network drive, or would the latency ultimately hinder the gains?
#2. A little harder: split the thumbnailing code out of my app, as a separate webservice, that would run on a second server. This way the main application and database would not suffer from high CPU usage, and the web pages would be served fast. The thumbnails are anyway already served asynchronously with JavaScript
Can anyone recommend some other solution?
Are you sure your performance problems come from thumbnails? OK, I suppose you've checked that.
You can downsize and upload the 2 thumbnails to S3 immediately (or shortly) after user uploaded the image. This way you should be able to save unnecessary CPU load you're now wasting for every HTTP request checking those thumbnails and doing IPC with memcached.
In a way your problem is a "good" problem to have (or at least it could have been a lot worse), in that there are no dependencies between separate image resizing tasks, so you can trivially distribute them over multiple servers. A few comments:
Have you checked to see if there is anything you can do to make the image resizing operations faster? (Google brought this up, don't know if it's any help: http://dmmartins.appspot.com/blog/speeding-up-image-resizing-with-python-and-pil) Even if you still find you need to add more servers, anything you can do to make each resize operation more efficient will make each server go farther.
If your users keep becoming more and more, you will eventually need to "scale out", but for the short term, it is possible you could solve the problem simply by paying another $80 for the next "tier" of service (8 cores at 8x priority).
Is image resizing really your app's only bottleneck? If image resizing was "free", how much further can you scale on your existing server before rendering pages, running DB queries, etc. would limit throughput? If you don't know, it would be good to do some simulated load testing and find out. I ask because if rendering pages, DB queries, etc. are also bottlenecks, or are soon to become bottlenecks, you are going to have to distribute the app anyways. In that case, you might as well keep thumbnailing in the main app, and distribute it right now, rather than making your thumbnailing run as a web service on a 2nd server.
Regardless of whether you distribute the main app, or split out thumbnailing into a separate app on a different server, you need some kind of authoritative store to keep track of where each thumbnail is kept on S3. You can keep that information in memcached, in a database, or wherever you want. It doesn't really matter. Even if you keep it in memcached, that doesn't mean you can't share the cache between 2 servers -- 1 server can connect to a memcached instance running on the other server.
You asked if "the latency" of checking a cache which is held on a different server will "hinder the gains". I don't think you need to worry about that. Your problem is throughput, not latency. Those high-latency network operations parallelize very well. So if you just service more requests in parallel, you can still make full use of your CPUs (which is the resource bottleneck right now).

Resources