RefreshHit from cloudfront even with cache-control: max-age=0, no-store - caching

Cloudfront is getting a RefreshHit for a request that is not supposed to be cached at all.
It shouldn't be cached because:
It has cache-control: max-age=0, no-store;
The Minimum TTL is 0; and
I've created multiple invalidations (on /*) so this cached resource isn't from some historical deploy
Any idea why I'm getting RefreshHits?
I also tried modifying Cache-Control to be cache-control no-store, stale-if-error=0, creating a new invalidation on /* and now I'm seeing a cache hit (this time in Firefox):

After talking extensively with support, they explained what's going on.
So, if you have no-store and a Minimum TTL of 0, then CloudFront will indeed not store your resources. However, if your Origin is taking a long time to respond (so likely under heavy load), while CloudFront waits for the response to the request, if it gets another identical request (identical with respect to the cache key), then it'll send the one response to both requests. This is in order to lighten the load on the server. (see docs)
Support was calling these "collapse hits" although I don't see that in the docs.
So, it seems you can't have a single Behavior serving some pages that must have a unique response per request while serving other pages that are cached. Support said:
I just confirmed that, with min TTL 0 and cache-control: no-store, we cannot disable collapse hit. If you do need to fully disable cloudfront cache, you can use cache policy CachingDisabled
We'll be making a behavior for every path prefix that we need caching on. It seems there was no better way than this for our use-case (transitioning our website one page at a time from a non-cacheable, backend-rendered jinja2/jQuery to a cacheable, client-side rendered React/Next.js).

It's probably too late for OP's project, but I would personally handle this with a simple origin-response Lambda#Edge function, and a single cache behavior for /* and cache policy. You can write all of the filtering/caching logic in the origin-response function. That way you only manage one bit of function code in one place, instead of a bunch of individual cache behaviors (and possibly a bunch of cache policies).
For example, an origin-response function that looks for a cache-control response header coming from your origin. If it exists, pass it back to the client. However if it doesn't exist (or if you want to overwrite it with something else) then you can create the response header there. The edge doesn't care if the cache-control header came from your origin, or from an origin-response Lambda. To the edge, it is all the same.

Another trick you can use in order to avoid caching and still use the default CloudFront behavior, is: Have a dummy unused query parameter that equals to a unique value for each request.
Python example:
import requests
import uuid
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
Note that both calls will reach destination and will not get response from cache since the uuid.uuid4() will always generate a unique value
This works since by default (if not defined otherwise in the Behavior section) the query parameters are part of the cache key
Note: Doing so will avoid cache use, hence your backend might be loaded with requests.

Related

How to avoid AJAX caching in Internet Explorer 11 when additional query string parameters or using POST are not an option

I realize this question has been asked, but in modern REST practice none of the previous iterations of this question nor their answers are accurate or sufficient. A definitive answer to this question is needed.
The problem is well known, IE (even 11) caches AJAX requests, which is really really dumb. Everyone understands this.
What is not well understood is that none of the previous answers are sufficient. Every previous instance of this question on SO is marked as sufficiently answered by either:
1) Using a unique query string parameter (such as a unix timestamp) on each request, so as to make each request URL unique, thereby preventing caching.
-- or --
2) using POST instead of GET, as IE does not cache POST requests except in certain unique circumstances.
-- or --
3) using 'cache-control' headers passed by the server.
IMO in many situations involving modern REST API practice, none of these answers are sufficient or practical. A REST API will have completely different handlers for POST and GET requests, with completely different behavior, so POST is typically not an appropriate or correct alternative to GET. As well, many APIs have strict validation around them, and for numerous reasons, will generate 500 or 400 errors when fed query string parameters that they aren't expecting. Lastly, often we are interfacing with 3rd-party or otherwise inflexible REST APIs where we do not have control over the headers provided by the server response, and adding cache control headers is not within our power.
So, the question is:
Is there really nothing that can be done on the client-side in this situation to prevent I.E. from caching the results of an AJAX GET request?
Caching is normally controlled through setting headers on the content when it is returned by the server. If you're already doing that and IE is ignoring them and caching anyway, the only way to get around it would be to use one of the cache busting techniques mentioned in your question. In the case of an API, it would likely be better to make sure you are using proper cache headers before attempting any of the cache busting techniques.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching_FAQ
Cache-control: no-cache
Cache-control: no-store
Pragma: no-cache
Expires: 0
If you don't control the API, you might be able to disable IE caching by adding request headers on the ajax gets:
'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0'

Cache Policy - caching only if request succeeded

I have enabled some cache policies on a few resources end points. System works quite well, response is cached, the following requests hit the cache, cache is correctly refreshed when I set it to be refreshed.
My only concern is that sometimes a client makes a request that does not hit the cache (for example, because the cache must be refreshed), the server in that moment returns an error (it can happen, it's statistic...) and so the cached response is not a "normal" response (e.g. 2xx) but a 4xx, or a 5xx response.
I would like to know if it is possible to cache the response only if, for example, the server response code is 2xx.
I didn't find any example on Apigee docs for doing this, also if there are some parameters for the cache policy called "SkipCachePopulation" that I think I can use for this purpose.
Any suggestion?
Yes, you can use the SkipCachePopulation field of ResponseCache. It uses a condition to determine when the cache population will not occur. Here is an example:
<SkipCachePopulation>response.status.code >= 400</SkipCachePopulation>

OkHttpResponseCache returning garbaged text

Has anyone used OkHttpResponseCache on an authenticated HTTPS connection successfully? I'm facing this strange behavior where the second time content is served from cache it return garbage instead of the correct content.
The service I'm using as an example is protected with HTTP basic auth and accessed over HTTPS. I added the must-revalidate response header to allow the cache to store the response. It uses ETags as a mechanism for cache validation.
It works perfectly for the first cached response:
1 - First time I make a service call, the server return 200 OK. Debugging the response cache source code I could see the response if being passed to the put() method (which stores it in the file store).
2 - The second request is made. Server is hit and returns a 304 response. Checking the cache headers and step debugging confirmed me that the server indeed returned 304. One strange behavior tough: on the second response (now served by the cache), the ETag header now contains the value duplicated (instead of having a single value, it has this value in a list twice).
3 - I make a third request. Same behavior as above, same "duplication" of the ETag value but when I get the input stream, instead of the correct json text, I get garbage (as in a bunch of black diamonds with an interrogation inside).
I don't know if I'm doing something wrong (when configuring the cache). I don't know if this is an encoding problem, or if the cache tried to update the file store and messed the input up. I suspect after the first cache response (in the second attempt) the presence of a second ETag value invalidates the headers and the cache attempts to store the response once again and ends messing it up. I wasn't able to step debug through it to confirm yet.
I tried to "translate" the garbage text with UTF-8, 16, ASCII, ISO to no avail. I tried using HTTP instead of HTTPS and got the same behavior.
Did someone encounter something similar and was able to solve it? Is it a bug in the cache or could I be doing something wrong?
There was a bug in the way that OkHttp updates the Content-Encoding: gzip header when doing conditional GETs. It's fixed in the OkHttp 1.3.

How long should static file be cached?

I'd like to set browser caching for some Amazon S3 files. I plan to use this meta data:
Cache-Control: max-age=86400, must-revalidate
that's equal to one day.
Many of the examples I see look like this:
Cache-Control: max-age=3600
Why only 3600 and why not use must-revalidate?
For a file that I rarely change, how long should it be cached?
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Why only 3600 ?
Assumingly because the author of that particular example decided that one hour was an appropiate cache timeout for that page.
Why not use must-revalidate ?
If the response does not contain information that is strictly required to follow the cache rules you set, omitting must-revalidate could in theory ensure that a few more requests are delivered through the cache. See this answer for details, the most relevant part being from the HTTP spec:
When a cache has a stale entry that it would like to use as a response
to a client's request, it first has to check with the origin server
(or possibly an intermediate cache with a fresh response) to see if
its cached entry is still usable.
For a file that I rarely change, how long should it be cached?
Many web performance advices says to set a very far into the future cache expiration, such as a few years. This way, the client browser will only download the data once, and subsequent visits will be served from the cache. This works well for "truly static" files, such as Javascript or CSS.
On the other hand, if the data is dynamic, but does not change too often, you should set an expiration time that is reasonable based for your specific scenario. Do you need to get the newest version to the customer as soon as it's available, or is it okay to serve a stale version ? Do you know when the data change ? Etc. An hour or a day is often appropiate trade-offs between server load, client performance, and data freshness, but it depends on your requirements.
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Give the file a new name, or append a value to the querystring. You will of course need to update all links. This is the general approach when static resources need to change.
Also, here is a nice overview of the cache control attributes available to you.

leverage browser caching - expires or max-age, last-modified or etag

I'm having difficulty finding a clear-cut, practical explanation for what is the correct way to leverage browser caching to increase page speed.
According to this site:
It is important to specify one of Expires or Cache-Control max-age,
and one of Last-Modified or ETag, for all cacheable resources. It is
redundant to specify both Expires and Cache-Control: max-age, or to
specify both Last-Modified and ETag.
Is this correct? If so, should I use Expires or max-age? I think I have a general understanding of what both of those are but don't know which is usually best to use.
If I have to also do Last-Modified or ETag, which one of those? I think I get Last-Modified but am still very fuzzy on this ETag concept.
Also, which files should I enable browser caching for?
Is this correct?
Yes, Expires and max-age do the same thing, but in two different ways. Same thing with Last-Modified and Etag
If so, should I do
expires or max-age?
Expires depends on accuracy of user's clock, so it's mostly a bad choice (as most browsers support HTTP/1.1). Use max-age, to tell the browser that the file is good for that many seconds. For example, a 1 day cache would be:
Cache-Control: max-age=86400
Note that when both Cache-Control and Expires are present, Cache-Control takes precedence. read
If I have to also do Last-Modified or ETag, which one of those? I think I get Last-Modified
You're right, Last-Modified should be better. Although it's a time, it's sent by the server. Hence no issue with user's clock. It's the reason why Last-Modified is better than Expires.
The browser sends the Last-Modified the server sent last time it asked for the file, and if it's the same, the server answsers with an empty response «304 Not Modified»
Also, you must note that ETag can be useful too, because Last-Modified has a time window of one second. So you can't distinguish two different sources with the same Last-Modified value. [2]
Etag needs some more computing than Last-Modified, as it's a signature of the current state of the file (similar to a md5 sum or a CRC32).
Also, which files should I enable browser caching for?
All files can benefit caching. You've got two different approaches:
with max-age: useful for files that never change (images, CSS, javascript). For as long as the max-age value, the browser won't send any request to the server. So you'll see the page loading really fast on the second load. If you need to update files, just append a question mark, and the date of change (for example /image.png?20110602, or for better proxies caching, something like /20110602/image.png or /image.20110602.png). This way you can make files expire if it's urgent (remember that the browser almost never hits the server once it has a max-age file). Main use is to speed things up and limit requests sent to the server.
with Last-Modified: can be set on all files (including those with max-age). Even if you have dynamic pages, you may not change the content of the file for a while (even if it's 10 min), so that could be useful. The main use here is to tell the browser «keep asking me for this file, if it's new, I'll send you the new one». So, there's a request sent on each page load, but the answer is empty if the file is already good (304 Not Modified), so you save on bandwidth.
The more you cache, the faster your pages will show up. But it's a difficult task to flush caches, so use with care.
A good place to learn all this with many explanations: http://www.mnot.net/cache_docs/
[2]: rfc7232 Etag https://www.rfc-editor.org/rfc/rfc7232#section-2.3

Resources