client-side file caching - caching

If I understand correctly, a broswer caches images, JS files, etc. based on the file name. So there's a danger that if one such file is updated (on the server), the browser will use the cached copy instead.
A workaround for this problem is to rename all files (as part of the build), such that the file name includes an MD5 hash of it's contents, e.g.
foo.js -> foo_AS577688BC87654.js
me.png -> me_32126A88BC3456BB.png
However, in addition to renaming the files themselves, all references to these files must be changed. For exmaple a tag such as <img src="me.png"/> should be changed to <img src="me_32126A88BC3456BB.png"/>.
Obviously this can get pretty complicated, particularly when you consider that references to these files may be dynamically created within server-side code.
Of course, one solution is to completely disable caching on the browser (and any caches between the server and the browser) using HTTP headers. However, having no caching will create it's own set of problems.
Is there a better solution?
Thanks,
Don

The best solution seems to be to version filenames by appending the last-modified time.
You can do it this way: add a rewrite rule to your Apache configuration, like so:
RewriteRule ^(.+)\.(.+)\.(js|css|jpg|png|gif)$ $1.$3
This will redirect any "versioned" URL to the "normal" one. The idea is to keep your filenames the same, but to benefit from cache. The solution to append a parameter to the URL will not be optimal with some proxies that don't cache URLs with parameters.
Then, instead of writing:
<img src="image.png" />
Just call a PHP function:
<img src="<?php versionFile('image.png'); ?>" />
With versionFile() looking like this:
function versionFile($file){
$path = pathinfo($file);
$ver = '.'.filemtime($_SERVER['DOCUMENT_ROOT'].$file).'.';
echo $path['dirname'].'/'.str_replace('.', $ver, $path['basename']);
}
And that's it! The browser will ask for image.123456789.png, Apache will redirect this to image.png, so you will benefit from cache in all cases and won't have any out-of-date issue, while not having to bother with filename versioning.
You can see a detailed explanation of this technique here: http://particletree.com/notebook/automatically-version-your-css-and-javascript-files/

Why not just add a querystring "version" number and update the version each time?
foo.js -> foo.js?version=5
There still is a bit of work during the build to update the version numbers but filenames don't need to change.

Renaming your resources is the way to go, although we use a build number and embed that in to the file name instead of an MD5 hash
foo.js -> foo.123.js
as it means that all your resources can be renamed in a deterministic fashion and resolved at runtime.
We then use custom controls to generate links to resources at on page load based upon the build number which is stored in an app setting.

We followed a similar pattern to PJP, using Rails and Nginx.
We wanted user avatar images to be browser cached, but on an avatar's change we needed the cache to be invalidated ASAP.
We added a method to the avatar model to append a timestamp to the file name:
return "/images/#{sourcedir}/#{user.login}-#{self.updated_at.to_s(:flat_string)}.png"
In all places in the code where avatars were used, we referenced this method rather than an URL. In the Nginx configuration, we added this rewrite:
rewrite "^/images/avatars/(.+)-[\d]{12}.png" /images/avatars/$1.png;
rewrite "^/images/small-avatars/(.+)-[\d]{12}.png" /images/small-avatars/$1.png;
This meant if a file changed, its URL in the HTML changed, so the user's browser made a new request for the file. When the request reached Nginx, it got rewritten to the simple name of the file.

I would suggest using caching by ETags in this situation, see http://en.wikipedia.org/wiki/HTTP_ETag. You can then use the hash as the etag. A request will still be submitted for each resource, but the browser will only download items that have changed since last download.
Read up on your web server / platform docs on how to use etags properly, most decent platforms have built-in support.

Most modern browsers check the if-modified-since header whenever a cacheable resource is in a HTTP request. However, not all browsers support the if-modified-since header.
There are three ways to "force" the browser to load a cached resource.
Option 1 Create a query string with a version#. src="script.js?ver=21". The downside is many proxy servers wont cache a resource with query strings. It also requires site-wide updating for changes.
Option 2 Create a naming system for your files src="script083010.js". However the downside to option 1 is that this as well requires site-wide updates whenever a file changes.
Option 3 Perhaps the most elegant solution, simply set up the caching headers: last-modified and expires in your server. The main downside to this is users may have to recache resources because they expired yet never changed. Additionally, the last-modified header does not work well when content is being served from multiple servers.
Here a few resources to check out: Yahoo Google AskApache.com

This is really only an issue if your web server sets a far-future "Expires" header (setting something like ExpiresDefault "access plus 10 years" in your Apache config). Otherwise, a browser will make a conditional GET, based on the modified time and/or the Etag. You can verify what is happening on your site by using a web proxy or an extension like Firebug (on the Net panel). Your question doesn't mention how your web server is configured, and what headers it is sending with static files.
If you're not setting a far-future Expires header, there's nothing special you need to do. Your web server will usually handle conditional GETs for static files based on last modified time just fine. If you are setting a far-future Expires header then yes, you need to add some sort of version to the file name like your question and the other answers have mentioned already.

I have also been thinking about this for a site I support where it would be a big job to change all references. I have two ideas:
1.
Set distant cache expiry headers and apply the changes you suggest for the most commonly downloaded files. For other files set the headers so they expire after a very short time - eg. 10 minutes. Then if you have a 10 minute downtime when updating the application, caches will be refreshed by the time users go to the site. General site navigation should be improved as the files will only need downloading every 10 minutes not every click.
2.
Each time a new version of the application is deployed to a different context that contains the version number. eg. www.site.com/app_2_6_0/ I'm not really sure about this as users bookmarks would be broken on each update.

I believe that a combination of solutions works best:
Setting cache expiry dates for each type of resource (image, page, etc) appropreatly for that resource, for example:
Your static "About", "Contact" etc pages probably arn't going to change more than a few time a year, so you could easily put a cache time of a month on these pages.
Images used in these pages could have eternal cache times, as you are more likey to replace an image then to change one.
Avatar images might have an expiry time of a day.
Some resources need modified dates in their names. For example avatars, generated images, and the like.
Some things should never be caches, new pages, user content etc. In these cases you should cache on the server, but never on the client side.
In the end you need to carfully consider each type of resource to determine what cache time to instruct the browser to use, and always be conservitive if you are unsure. You can increase the time later, but it's much more pain to uncache something.

You might want to check out the approach taken by the grails "uiperformance" plugin, which you can find here. It does a lot of the things you mention, but automates them (set expiry time to a long time, then increments version numbers when files change).
So if you're using grails, you get this stuff for free. If you are not - maybe you can borrow the techniques employed.
Also - borrowed form the ui-performance page, - read the following 14 rules.

ETags seemingly provide a solution for this...
As per http://httpd.apache.org/docs/2.0/mod/core.html#fileetag, we can set the browser to generate ETags on file-size (instead of time/inode/etc). This generation should be constant across multiple server deployments.
Just enable it in (/etc/apache2/apache2.conf)
FileETag Size
& you should be good!
That way, you can simply reference your images as <img src='/path/to/foo.png' /> and still use all the goodness of HTTP caching.

Related

Send an entire web app as 1 HTTP response (html, js, css, images, ...)

Traditionally a browser will parse HTML and then send further requests to the server for all related data. This seems like inefficient to me, since it might require a large number of requests, even though my server already knows that a browser that wants to use this web application will need all of it's resources.
I know that js and css could be inlined, but that complicates server side code and img data as base64 bloats the size of the data... I'm aware as well that rendering can start before all assets are downloaded, which would potentially no longer work (depending on the implementation). I still feel that streaming an entire application in one go should be faster on slow connections than making tens of requests separately.
Ideally I would like the server to stream an entire directory into one HTTP response.
Does any model for this exist?
Does the reasoning make sense?
ps: If browser support for this is completely lacking, I'm wondering about a 2 step approach. Download a small JavaScript which downloads a compressed web app file, extracts it and plugs the resources into the page. Is anyone already doing something like this?
Update
I found one: http://blog.another-d-mention.ro/programming/read-load-files-from-zip-in-javascript/
I started to research related issues in order to find the way to get best results with what seems possible without changing web standards, and I wondered about caching. If I could send the last modified date of every subresource of a page along with the initial HTML page, a browser could avoid asking if modified headers once it has loaded every resource at least once. This would in effect be better than to send all resources with the initial request, since that would be beneficial only on the first load, and detrimental on subsequent loads, since it would be better for browsers to use their cache (as Barmar pointed out).
Now it turns out that even with a web extension you can not get hold of the if-modified-since header and so you surely can't tell the browser to use the cached version instead of contacting the server.
I then found this post from Facebook on how they tried to reduce traffic by hashing their static files and giving them a 1 year expiry date. This would mean that the url garantuees the content of the file. They still saw plenty of unnecessary if-modified-since requests and they managed to convince Firefox and Chrome to change the behaviour of their reload buttons to no longer reload static resources. For Firefox this requires a new cache-control: immutable header, for Chrome it doesn't.
I then remembered that I had seen something like that before and it turns out there is a solution for this problem which is more convenient than hashing the contents of resources and serving them from a database for at least ten years. It is to just a new version number in the filename. The even more convenient solution would be to just add a version query string, but it turns out that that doesn't always work.
Admittedly, changing your filenames all the time is a nuisance, because files referencing these files also need to change. However the files don't actually need to change. If you control the server it might be as simple as writing a redirect rule to make sure that logo.vXXXX.png will be redirected to logo.png (where XXXX is the last modified timestamp in seconds since epoch)[1]. Now let your template system automatically generate the timestamp, like in wordpress' wp_enqueue_script. WordPress actually satisfies itself with the query string technique. Now you can set the expiration date to a far future and use the immutable cache header. If browsers respect the cache control, you can now safely ignore etags and if-modified-since headers, since they are now completely redundant.
This solution guarantees the browser shall never ask for cache validation and yet you shall never see a stale resource, without having to decide on the expiry date in advance.
It doesn't answer the original question here about how to avoid having to do multiple requests to fetch the resources on the same page on a clean cache, but ever after (as long as the browser cache doesn't get cleared), you're good! I suppose that's good enough for me.
[1] You can even avoid the server overhead of checking the timestamp on every resource every time a page references it by using the version number of your application. In debug mode, for development, one can use the timestamp to avoid having to bump the version on every modification of the file.

How do I set caching headers for my CSS/JS but ensure visitors always have the latest versions?

I'd like to speed up my site's loading time in part by ensuring all CSS/JS is being cached by the browser, as recommend by Google's PageSpeed tool. But I'd like to ensure that visitors have the latest CSS/JS files, if they are updated and the cache now contains old code.
From my research so far, appending something like "?459454" to the end of the CSS/JS url is popular. But wouldn't that force the visitor's browser to re-download the CSS/JS file every time?
Is there a way to set the files to be cached by the browser, but ensure the browser knows about updated versions of the cached files?
If you're using Apache, you can use mod_pagespeed (mentioned earlier by symcbean) to do this automatically.
It would work best if you also use the ModPagespeedLoadFromFile directive since that will create a new URL as soon as it detects that the resource has changed on disk, however it will work fine without that (it will use the cache expiry time returned when it fetches the resource to rewrite it).
If you're using nginx, you could use ngx_pagespeed.
If you're using IIS, you could use IISpeed, which is not a Google product and I don't know it's full feature set.
Version numbers will work, but you can also append a hash of the file to the filename with your web framework or asset build script:
<script src="script-5054a101c8b164cbfa570d97fe23cc0d.js"></script>
That way, once your HTML changes to reflect this new version, browsers will just download and cache the updated version of your script.
As you say, append a query string to the URL of the asset, but only change it if the content is different, or change it when you deploy a new version.
appending something like "?459454" to the end of the CSS/JS url is popular. But wouldn't that force the visitor's browser to re-download the CSS/JS file every time?
No it won't force them to download each time, however there are a lot of intermediate proxies out there which ignore query strings on cacheable content - hence many tools (including mod_pagespeed which does automatic url rewriting based on file conents, and content merging on the fly along with lots of other cool tricks) move the version information into the path / filename.
If you've only got .htaccess type access then you can strip the version information out to map direct to a file, or use a scripted 404 redirector (but this is probably only a good idea if you're behind a caching reverse proxy).

Versioning and caching static files: CSS, JS, images -- What to consider

For an (enterprise) web project i want to keep previous versions of the static files so that projects can decide for themselves when they are ready to implement design changes. My initial plan is to provide folders for static content like so:
company.com/static/1.0.0/
company.com/static/1.0.0/css/
company.com/static/1.0.0/js/
company.com/static/1.0.0/images/
company.com/static/2.0.0/
company.com/static/2.0.0/css/
company.com/static/2.0.0/js/
company.com/static/2.0.0/images/
Each file in these folders should then have a cache-policy to cache "forever" -- one year at least. I also plan to concatenate css files and js files into one, in order to minimize number of requests.
Then i would also provide a current folder (which symlinks to the latest released version)
company.com/static/current/
company.com/static/current/css/
company.com/static/current/js/
company.com/static/current/images/
This will solve my first problem (that projects and sub websites can lock their code to a certain version and can upgrade whenever they are ready).
But then I can see some caching issue. Now i cannot "just" cache current folder, since it will change for each release. What should my caching policies be on that folder.
Also, for each release, most of the static files will never change anyway. Is it relevant to cache them forever, and rename if there are changes?
I am looking for advice here, since i want to know about your best trade-off between caching and changing the files.
Beware of HTTP caching. I looked into this some time ago.
my blog article on the HTTP caching
There are three approaches you can select from:
Use resource's path as a cache key, i.e. when it changed - the browser will have to download new version of your resources. In this case you don't need /current folder at all, you just need to avoid .html page caching and put appropriate path to your resources in it.
You can point browser to /current folders only and add ETag to your resources, in this case another server request will be made from the client, but it will be conditional request (i.e. with If-None-Match header), so you can return 304 response (with no resource body) until your customer decide to migrate to another version. Another drawback of such solution (if you have several customers who use different versions) is that /current folder will contain only some single version of the design.
As you're going to concatenate resources into single files, you can specify resource version as part of url: /current/js/combined.js?version=1.0.0.0 But this is not much different from first approach.
Hope this helps.
It might be worth your while looking at how Google, Microsoft etc. have implemented the caching policies for their jQuery CDNs
Your policy of caching forever is OK for the versioned URLs.
For the current URLs you're obviously going to need a shorter expiry time.
Couple of things to consider:
How are the applications going to be able to test against /current/ i.e. if they use it how do you know a change isn't going to break an existing application?
Caching forever is only really about reducing requests during the 'current session' as most browser caches aren't big enough to hold files for a long time (they get removed as people browse others sites)

Lazy HTTP caching

I have a website which is displayed to visitors via a kiosk. People can interact with it. However, since the website is not locally hosted, and uses an internet connection - the page loads are slow.
I would like to implement some kind of lazy caching mechanism such that as and when people browse the pages - the pages and the resources referenced by the pages get cached, so that subsequent loads of the same page are instant.
I considered using HTML5 offline caching - but it requires me to specify all the resources in the manifest file, and this is not feasible for me, as the website is pretty large.
Is there any other way to implement this? Perhaps using HTTP caching headers? I would also need some way to invalidate the cache at some point to "push" the new changes to the browser...
The usual approach to handling problems like this is with HTTP caching headers, combined with smart construction of URLs for resources referenced by your pages.
The general idea is this: every resource loaded by your page (images, scripts, CSS files, etc.) should have a unique, versioned URL. For example, instead of loading /images/button.png, you'd load /images/button_v123.png and when you change that file its URL changes to /images/button_v124.png. Typically this is handled by URL rewriting over static file URLs, so that, for example, the web server knows that /images/button_v124.png should really load the /images/button.png file from the web server's file system. Creating the version numbers can be done by appending a build number, using a CRC of file contents, or many other ways.
Then you need to make sure that, wherever URLs are constructed in the parent page, they refer to the versioned URL. This obviously requires dynamic code used to construct all URLs, which can be accomplished either by adjusting the code used to generate your pages or by server-wide plugins which affect all text/html requests.
Then, you then set the Expires header for all resource requests (images, scripts, CSS files, etc.) to a date far in the future (e.g. 10 years from now). This effectively caches them forever. This means that all requests loaded by each of your pages will be always be fetched from cache; cache invalidation never happens, which is OK because when the underlying resource changes, the parent page will use a new URL to find it.
Finally, you need to figure out how you want to cache your "parent" pages. How you do this is a judgement call. You can use ETag/If-None-Match HTTP headers to check for a new version of the page every time, which will very quickly load the page from cache if the server reports that it hasn't changed. Or you can use Expires (and/or Max-Age) to reload the parent page from cache for a given period of time before checking the server.
If you want to do something even more sophisticated, you can always put a custom proxy server on the kiosk-- in that case you'd have total, centralized control over how caching is done.

Question about using subdomains to force caching

I haven't had a huge opportunity to research the subject but I figure I'll just ask the question and see if we can create a knowledge base on the subject here.
1) Using subdomains will force a client side cache, is this by default or is there an easy way for a client to disable it? More curious about what kind of a percentage of users I should be expecting to affect.
2) What all will be cached? Images? Stylesheets? Flash SWFs? Javascripts? Everything?
3) I remember reading that you must use a subdomain or www in your URL for this to work, is this correct? (and does this mean SO won't allow it?)
I plan on integrating this onto all of my websites eventually but first I am going to try to do it for a network of flash game websites so I am thinking www.example.com for the website will remain the same but instead of using www.example.com/images, www.example.com/stylesheets, www.example.com/javascript, & www.example.com/swfs I will just create subdomains that point to them (img.example.com, css.example.com, js.example.com & swf.example.com respectively) -- is this the best course of action?
Using subdomains for content elements isn't so much to force caching, but to trick a browser into opening more connections than it might otherwise do. This can speed up page load time.
Caching of those elements is entirely down the HTTP headers delivered with that content.
For static files like CSS, JS etc, a server will typically tell the client when the file was modified, which allows a browser to ask for the file "If-Modified-Since" that timestamp. Specifics of how to improve on this by adding some extra caching headers would depend on which webserver you use. For example, with Apache you can use the mod_expires module to set the Expires header, or the Header directive to output other types of cache control headers.
As an example, if you had a subdirectory with your css files in, and wanted to ensure they were cached for at least an hour, you could place a .htaccess in that directory with these contents
ExpiresActive On
ExpiresDefault "access plus 1 hours"
Check out YSlow's documentation. YSlow is a plugin for Firebug, the amazing Firefox web development plugin. There is lots of good info on a number of ways to speed up your page loads, one of which is using one or more subdomains to encourage the browser to do more parallel object loads.
One thing I've done on two Django sites is to use a custom template tag to create pseudo-paths to images, css, etc. The path contains the time-last-modified as a pseudo directory. This path component is stripped out by an Apache .htaccess mod_rewrite rule. The object is then given a 10 year time-to-live (ExpiresDefault "now plus 10 years") so the browser will only load it once. If the object changes, the pseudo path changes and the browser will fetch the updated object.

Resources