leverage browser caching - expires or max-age, last-modified or etag - caching

I'm having difficulty finding a clear-cut, practical explanation for what is the correct way to leverage browser caching to increase page speed.
According to this site:
It is important to specify one of Expires or Cache-Control max-age,
and one of Last-Modified or ETag, for all cacheable resources. It is
redundant to specify both Expires and Cache-Control: max-age, or to
specify both Last-Modified and ETag.
Is this correct? If so, should I use Expires or max-age? I think I have a general understanding of what both of those are but don't know which is usually best to use.
If I have to also do Last-Modified or ETag, which one of those? I think I get Last-Modified but am still very fuzzy on this ETag concept.
Also, which files should I enable browser caching for?

Is this correct?
Yes, Expires and max-age do the same thing, but in two different ways. Same thing with Last-Modified and Etag
If so, should I do
expires or max-age?
Expires depends on accuracy of user's clock, so it's mostly a bad choice (as most browsers support HTTP/1.1). Use max-age, to tell the browser that the file is good for that many seconds. For example, a 1 day cache would be:
Cache-Control: max-age=86400
Note that when both Cache-Control and Expires are present, Cache-Control takes precedence. read
If I have to also do Last-Modified or ETag, which one of those? I think I get Last-Modified
You're right, Last-Modified should be better. Although it's a time, it's sent by the server. Hence no issue with user's clock. It's the reason why Last-Modified is better than Expires.
The browser sends the Last-Modified the server sent last time it asked for the file, and if it's the same, the server answsers with an empty response «304 Not Modified»
Also, you must note that ETag can be useful too, because Last-Modified has a time window of one second. So you can't distinguish two different sources with the same Last-Modified value. [2]
Etag needs some more computing than Last-Modified, as it's a signature of the current state of the file (similar to a md5 sum or a CRC32).
Also, which files should I enable browser caching for?
All files can benefit caching. You've got two different approaches:
with max-age: useful for files that never change (images, CSS, javascript). For as long as the max-age value, the browser won't send any request to the server. So you'll see the page loading really fast on the second load. If you need to update files, just append a question mark, and the date of change (for example /image.png?20110602, or for better proxies caching, something like /20110602/image.png or /image.20110602.png). This way you can make files expire if it's urgent (remember that the browser almost never hits the server once it has a max-age file). Main use is to speed things up and limit requests sent to the server.
with Last-Modified: can be set on all files (including those with max-age). Even if you have dynamic pages, you may not change the content of the file for a while (even if it's 10 min), so that could be useful. The main use here is to tell the browser «keep asking me for this file, if it's new, I'll send you the new one». So, there's a request sent on each page load, but the answer is empty if the file is already good (304 Not Modified), so you save on bandwidth.
The more you cache, the faster your pages will show up. But it's a difficult task to flush caches, so use with care.
A good place to learn all this with many explanations: http://www.mnot.net/cache_docs/
[2]: rfc7232 Etag https://www.rfc-editor.org/rfc/rfc7232#section-2.3

Related

RefreshHit from cloudfront even with cache-control: max-age=0, no-store

Cloudfront is getting a RefreshHit for a request that is not supposed to be cached at all.
It shouldn't be cached because:
It has cache-control: max-age=0, no-store;
The Minimum TTL is 0; and
I've created multiple invalidations (on /*) so this cached resource isn't from some historical deploy
Any idea why I'm getting RefreshHits?
I also tried modifying Cache-Control to be cache-control no-store, stale-if-error=0, creating a new invalidation on /* and now I'm seeing a cache hit (this time in Firefox):
After talking extensively with support, they explained what's going on.
So, if you have no-store and a Minimum TTL of 0, then CloudFront will indeed not store your resources. However, if your Origin is taking a long time to respond (so likely under heavy load), while CloudFront waits for the response to the request, if it gets another identical request (identical with respect to the cache key), then it'll send the one response to both requests. This is in order to lighten the load on the server. (see docs)
Support was calling these "collapse hits" although I don't see that in the docs.
So, it seems you can't have a single Behavior serving some pages that must have a unique response per request while serving other pages that are cached. Support said:
I just confirmed that, with min TTL 0 and cache-control: no-store, we cannot disable collapse hit. If you do need to fully disable cloudfront cache, you can use cache policy CachingDisabled
We'll be making a behavior for every path prefix that we need caching on. It seems there was no better way than this for our use-case (transitioning our website one page at a time from a non-cacheable, backend-rendered jinja2/jQuery to a cacheable, client-side rendered React/Next.js).
It's probably too late for OP's project, but I would personally handle this with a simple origin-response Lambda#Edge function, and a single cache behavior for /* and cache policy. You can write all of the filtering/caching logic in the origin-response function. That way you only manage one bit of function code in one place, instead of a bunch of individual cache behaviors (and possibly a bunch of cache policies).
For example, an origin-response function that looks for a cache-control response header coming from your origin. If it exists, pass it back to the client. However if it doesn't exist (or if you want to overwrite it with something else) then you can create the response header there. The edge doesn't care if the cache-control header came from your origin, or from an origin-response Lambda. To the edge, it is all the same.
Another trick you can use in order to avoid caching and still use the default CloudFront behavior, is: Have a dummy unused query parameter that equals to a unique value for each request.
Python example:
import requests
import uuid
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
Note that both calls will reach destination and will not get response from cache since the uuid.uuid4() will always generate a unique value
This works since by default (if not defined otherwise in the Behavior section) the query parameters are part of the cache key
Note: Doing so will avoid cache use, hence your backend might be loaded with requests.

How to avoid AJAX caching in Internet Explorer 11 when additional query string parameters or using POST are not an option

I realize this question has been asked, but in modern REST practice none of the previous iterations of this question nor their answers are accurate or sufficient. A definitive answer to this question is needed.
The problem is well known, IE (even 11) caches AJAX requests, which is really really dumb. Everyone understands this.
What is not well understood is that none of the previous answers are sufficient. Every previous instance of this question on SO is marked as sufficiently answered by either:
1) Using a unique query string parameter (such as a unix timestamp) on each request, so as to make each request URL unique, thereby preventing caching.
-- or --
2) using POST instead of GET, as IE does not cache POST requests except in certain unique circumstances.
-- or --
3) using 'cache-control' headers passed by the server.
IMO in many situations involving modern REST API practice, none of these answers are sufficient or practical. A REST API will have completely different handlers for POST and GET requests, with completely different behavior, so POST is typically not an appropriate or correct alternative to GET. As well, many APIs have strict validation around them, and for numerous reasons, will generate 500 or 400 errors when fed query string parameters that they aren't expecting. Lastly, often we are interfacing with 3rd-party or otherwise inflexible REST APIs where we do not have control over the headers provided by the server response, and adding cache control headers is not within our power.
So, the question is:
Is there really nothing that can be done on the client-side in this situation to prevent I.E. from caching the results of an AJAX GET request?
Caching is normally controlled through setting headers on the content when it is returned by the server. If you're already doing that and IE is ignoring them and caching anyway, the only way to get around it would be to use one of the cache busting techniques mentioned in your question. In the case of an API, it would likely be better to make sure you are using proper cache headers before attempting any of the cache busting techniques.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching_FAQ
Cache-control: no-cache
Cache-control: no-store
Pragma: no-cache
Expires: 0
If you don't control the API, you might be able to disable IE caching by adding request headers on the ajax gets:
'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0'

How long should static file be cached?

I'd like to set browser caching for some Amazon S3 files. I plan to use this meta data:
Cache-Control: max-age=86400, must-revalidate
that's equal to one day.
Many of the examples I see look like this:
Cache-Control: max-age=3600
Why only 3600 and why not use must-revalidate?
For a file that I rarely change, how long should it be cached?
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Why only 3600 ?
Assumingly because the author of that particular example decided that one hour was an appropiate cache timeout for that page.
Why not use must-revalidate ?
If the response does not contain information that is strictly required to follow the cache rules you set, omitting must-revalidate could in theory ensure that a few more requests are delivered through the cache. See this answer for details, the most relevant part being from the HTTP spec:
When a cache has a stale entry that it would like to use as a response
to a client's request, it first has to check with the origin server
(or possibly an intermediate cache with a fresh response) to see if
its cached entry is still usable.
For a file that I rarely change, how long should it be cached?
Many web performance advices says to set a very far into the future cache expiration, such as a few years. This way, the client browser will only download the data once, and subsequent visits will be served from the cache. This works well for "truly static" files, such as Javascript or CSS.
On the other hand, if the data is dynamic, but does not change too often, you should set an expiration time that is reasonable based for your specific scenario. Do you need to get the newest version to the customer as soon as it's available, or is it okay to serve a stale version ? Do you know when the data change ? Etc. An hour or a day is often appropiate trade-offs between server load, client performance, and data freshness, but it depends on your requirements.
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Give the file a new name, or append a value to the querystring. You will of course need to update all links. This is the general approach when static resources need to change.
Also, here is a nice overview of the cache control attributes available to you.

What's default value of cache-control?

My problem is: sometimes browser over-cached some resources even if i've already modified them. But After F5, everything is fine.
I studied this case whole afternoon. Now i completely understood the point of "Last-Modified" or "Cache-Control". And i know how to solve my issue (just .js?version or explicit max-age=xxxx). But the problem is still unsolved: how does browser handle the response header without "Cache-Control" like this:
Content-Length: 49675
Content-Type: text/html
Last-Modified: Thu, 27 Dec 2012 03:03:50 GMT
Accept-Ranges: bytes
Etag: "0af7fcbdee3cd1:972"
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Date: Thu, 24 Jan 2013 07:46:16 GMT
They clearly cache them when "Enter in the bar"
RFC 7234 details what browsers and proxies should do by default:
Although caching is an entirely OPTIONAL feature of HTTP, it can be
assumed that reusing a cached response is desirable and that such
reuse is the default behavior when no requirement or local
configuration prevents it. Therefore, HTTP cache requirements are
focused on preventing a cache from either storing a non-reusable
response or reusing a stored response inappropriately, rather than
mandating that caches always store and reuse particular responses.
Caching is usually enabled by default in browers, so cache-control can be used to either customise this behaviour or disable it.
Although caching is an entirely OPTIONAL feature of HTTP, it can be assumed that reusing a cached response is desirable and that such reuse is the default behavior when no requirement or local configuration prevents it. Therefore, HTTP cache requirements are focused on preventing a cache from either storing a non-reusable response or reusing a stored response inappropriately, rather than mandating that caches always store and reuse particular responses. [https://www.rfc-editor.org/rfc/rfc7234#section-2]
The time the browser considers a cached response fresh is usually relative to when it was last modified:
Since origin servers do not always provide explicit expiration times, a cache MAY assign a heuristic expiration time when an explicit time is not specified, employing algorithms that use other header field values (such as the Last-Modified time)... If the response has a Last-Modified header field (Section 2.2 of [RFC7232]), caches are encouraged to use a heuristic expiration value that is no more than some fraction of the interval since that time. A typical setting of this fraction might be 10%. [https://www.rfc-editor.org/rfc/rfc7234#section-4.2.2]
This post has details of how the different browsers calculate that value.
The freshness lifetime is calculated based on several headers. If a "Cache-control: max-age=N" header is specified, then the freshness lifetime is equal to N. If this header is not present, which is very often the case, it is checked if an Expires header is present. If an Expires header exists, then its value minus the value of the Date header determines the freshness lifetime. Finally, if neither header is present, look for a Last-Modified header. If this header is present, then the cache's freshness lifetime is equal to the value of the Date header minus the value of the Last-modified header divided by 10.
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching#Freshness
The default cache-control header is : Private
A cache mechanism may cache this page in a private cache and resend it only to a single client. This is the default value. Most proxy servers will not cache pages with this setting.
Please see http://msdn.microsoft.com/en-us/library/ms524721%28v=vs.90%29.aspx
Without the cache control header the browser requests the resource every time it loads a new(?) page. Hitting F5 you invalidate (or even logically remove) any cached item within that page forcing the complete reload by acting as no local version is available - I am unsure if the browser removes those resources from cache before requesting them again.
The funny part is that there are some 'additional' settings within some browsers that cause some optimizations like requesting a resource only once per page loading. If you have an image that changes for every request like a counter you will see only one version of this image even if you use it multiple times.
The next one is that the browser reuses images that are not explicitly set as nocache by applying some sort of local 'prefered' caching. If you want to have a request every time you need to set it to revalidate and set expired to -1 or something like that.
So depending on the resource specifying nothing often trigger some defaults that are not the same you would expect from reading the specs.
There might be also different behaviour regarding whether the source appears to be local, a drive or a real distant internet server. Saidly not all browsers are acting differently and I am quite limited.
What helps is to check out www.google.com and look for the tracking pixel their page requests (two 1x1 pixel requested from metrics.gstats.com with random part on the subdomain).
If you use firebug to check out the header you see that they specify the nocache directives in any fashion possible. The header reads like this:
Alternate-Protocol 443:quic
Cache-Control no-cache, must-revalidate
Content-Length 35
Content-Type image/gif
Date Mon, 25 Nov 2013 14:33:30 GMT
Expires Fri, 01 Jan 1990 00:00:00 GMT
Last-Modified Tue, 14 Aug 2012 10:47:46 GMT
Pragma no-cache
Server sffe
X-Content-Type-Options nosniff
X-Firefox-Spdy 3
X-XSS-Protection 1; mode=block
Try this as a setting and check if this solves the issue that the browser did not pick up your changed resources. The must-revalidate directive will cause even proxy caches to request a resource every time and check for 304 Not Modified replies.
I currently experience something similar. I have a localhost connection setting the etag and all that happends is that the cache never ask. I did not set caching information or alike. Alone specifying an etag seams to cause FireFox to not request the resource again. So I experience something similar to your problem.
In your case, you have Etag: "0af7fcbdee3cd1:972" in response header, so its also cached.

How do you handle browser cache with login/logout?

To improve performances, I'd like to add a fairly long Cache-Control (up to 30 minutes) to each page since they do not change often. However, each page also displays the name of the user logged in (like this website).
The problem is when the user logs in or logs out: the user name must change. How can I change the user name after each login/logout action while keeping a long Cache-Control?
Here are the solutions I can think of:
Ajax request (not cached) to retrieve and display the user name. If I have 2 requests (/user?registered and /user?new), they could be cached as well. But I am afraid this extra request would nullify my caching performance-wise
Add a unique URL variable (?time=) to make the URL different, and cancel the cache. However, I would have to add this variable to all links on my webpage, not very convenient code-wise
This problems becomes greater if I actually have more content that is not the same for registered users and new users.
Cache-Control: private
Is usually enough in practice. It's what SO uses.
In theory, if you needed to allow for the case of variable logins from the same client you should probably set Vary on Cookie (assuming that's the mechanism you're using for login). However, this value of Vary (along with most others) messes up IE's caching completely so it's generally avoided. Also, it's often desirable to allow the user to step through the back/forward list including logged-in/out pages without having to re-fetch.
For situations where enforcing proper logged-in-ness for every page is critical (such as banking), an full Cache-Control: no-cache is typically used instead.

Resources