Varnish default grace behavior - caching

Have some api resources that are under heavy load, where responses are dynamic and to offload the origin servers we are using Varnish as a caching layer in front. The api responds with cache-control headers ranging from max-age=5 to max-age=15. Since we are using a low cache ttl a lot of requests still end up in a backend fetch. In that sense we are not sure we understand varnish request coalescing correctly with regards to grace. We have not touched any grace settings, using grace from VCL og sending stale-while-revalidate headers from the backend.
So question is; After a resource expires from the cache, all request for that resource will wait in varnish until the resource is fresh in the cache again, to prevent the thundering herd problem? Or will the default grace settings prevent “waiting” requests as they will be served “stale” content while the backend fetch completes? From the docs is not clear to us how the defaults work.

The basics about the lifetime of an object in Varnish
The total lifetime of an object is the sum of the following items:
TTL + grace + keep
Let's break this down:
The TTL defines the freshness of the content
Grace is used for asynchronous revalidation of expired content
Keep is used for synchronous revalidation of expired content
Here's the order of execution:
An object is served from cache as long as the TTL hasn't expired
When the TTL is equal or lower to zero, revalidation is required
As long as the sum of the remaining TTL (possibly below zero) and the grace time is more than zero, stale content can be served
If there is enough grace time, Varnish will asynchronously revalidate content while serving stale content
If both the TTL and the grace have expired, synchronous revalidation is required
Synchronous revalidation uses the waiting list and is subject to request coalescing
The remaining keep time will ensure the object is kept around so that conditional requests can take place
Default values
According to http://varnish-cache.org/docs/trunk/reference/varnishd.html#default-ttl the default TTL is set to 120 seconds
According tohttp://varnish-cache.org/docs/trunk/reference/varnishd.html#default-grace the default grace is set to 10 seconds
According tohttp://varnish-cache.org/docs/trunk/reference/varnishd.html#default-keep the default keep is set to 0 seconds
What about request coalescing?
The waiting list in Varnish that is used for request coalescing is only used for non-cached objects or expired objects that are passed their grace time.
The following scenarios will not trigger request coalescing:
TTL > 0
TTL + grace > 0
When the object is fresh or within grace, there is no need to use the waiting list, because the content will still be served from cache. In the case of objects within grace, a single asynchronous backend request will be sent to the origin for revalidation.
When an object is not in cache or out of grace, a synchronous revalidation is required, which is a blocking action. To avoid that this becomes problematic when multiple clients are requesting the same object, a waiting list is used and these requests are coalesced into a single backend request.
In the end, all the queued requests are satisfied in parallel by the same backend response.
Bypassing the waiting list
But here's an important remark about request coalescing:
Request coalescing only works for cacheable content. Stateful content that can never be satisfied by a coalesced response should bypass the waiting list. If not, serialization will take place.
Serialization is a bad thing. It means that queued requests cannot be satisfied by the response, and are handled serially. This head-of-line blocking can cause significant delays.
That's why stateless/uncacheable content should bypass the waiting list.
The decision to bypass the waiting list is made by the hit-for-miss cache. This mechanism caches the decision not to cache.
The following code is used for that:
set beresp.ttl = 120s;
set beresp.uncacheable = true;
It's the kind of VCL code you'll find in the built-in VCL of Varnish. It is triggered when a Set-Cookie header is found, or when Cache-Control: private, no-cache, no-store occurs.
This implies that for the next 2 minutes the object will be served from the origin, and the waiting list will be bypassed. When the next cache miss would return a cacheable response, the object is still stored in cache, and hit-for-miss no longer applies.
With that in mind it is crucial to not set beresp.ttl to zero. Because that would expire hit-for-miss information, and would still result in the next request ending up on the waiting list, even though we know the response will not be cacheable.

Related

How does stale-while-revalidate interact with s-maxage in Cache-Control header?

Just want to know around the header that I have specified for my SSR pages: public, s-maxage=3600, stale-while-revalidate=59.
Please note that my stale-while-revalidate value is 59 seconds which is way less than s-maxage value which is 1 hour. I want to know that when stale-while-revalidate value is smaller than s-maxage, what happens exactly? Is the stale-while-revalidate header ignored?
Setting the page's Cache-Control header to s-maxage=3600, stale-while-revalidate=59 means two things:
The page is considered fresh for 3600 seconds (s-maxage=3600);
The page will continue to be served from stale up to 59 seconds after that (stale-while-revalidate=59) while revalidation is done in the background.
stale-while-revalidate does not get ignored, it determines the extra time window when revalidation occurs after the page's maxage has passed (i.e. when the page is stale).
Here's the cache states across the three time windows (based on https://web.dev/stale-while-revalidate/#live-example):
0 to 3600s
3601s to 3660s
After 3660s
Cached page is fresh and used to serve the page. No revalidation.
Cached page is stale but used to serve the page. Revalidation occurs in the background to populate cache.
Cached page is stale and not used at all. New request is made to serve the page and populate cache.
Excerpt from the HTTP Cache-Control Extensions for Stale Content spec:
Generally, servers will want to set the combination of max-age and
stale-while-revalidate to the longest total potential freshness
lifetime that they can tolerate. For example, with both set to 600,
the server must be able to tolerate the response being served from
cache for up to 20 minutes.
Since asynchronous validation will only happen if a request occurs
after the response has become stale, but before the end of the
stale-while-revalidate window, the size of that window and the
likelihood of a request during it determines how likely it is that all
requests will be served without delay. If the window is too small, or
traffic is too sparse, some requests will fall outside of it, and
block until the server can validate the cached response.

RefreshHit from cloudfront even with cache-control: max-age=0, no-store

Cloudfront is getting a RefreshHit for a request that is not supposed to be cached at all.
It shouldn't be cached because:
It has cache-control: max-age=0, no-store;
The Minimum TTL is 0; and
I've created multiple invalidations (on /*) so this cached resource isn't from some historical deploy
Any idea why I'm getting RefreshHits?
I also tried modifying Cache-Control to be cache-control no-store, stale-if-error=0, creating a new invalidation on /* and now I'm seeing a cache hit (this time in Firefox):
After talking extensively with support, they explained what's going on.
So, if you have no-store and a Minimum TTL of 0, then CloudFront will indeed not store your resources. However, if your Origin is taking a long time to respond (so likely under heavy load), while CloudFront waits for the response to the request, if it gets another identical request (identical with respect to the cache key), then it'll send the one response to both requests. This is in order to lighten the load on the server. (see docs)
Support was calling these "collapse hits" although I don't see that in the docs.
So, it seems you can't have a single Behavior serving some pages that must have a unique response per request while serving other pages that are cached. Support said:
I just confirmed that, with min TTL 0 and cache-control: no-store, we cannot disable collapse hit. If you do need to fully disable cloudfront cache, you can use cache policy CachingDisabled
We'll be making a behavior for every path prefix that we need caching on. It seems there was no better way than this for our use-case (transitioning our website one page at a time from a non-cacheable, backend-rendered jinja2/jQuery to a cacheable, client-side rendered React/Next.js).
It's probably too late for OP's project, but I would personally handle this with a simple origin-response Lambda#Edge function, and a single cache behavior for /* and cache policy. You can write all of the filtering/caching logic in the origin-response function. That way you only manage one bit of function code in one place, instead of a bunch of individual cache behaviors (and possibly a bunch of cache policies).
For example, an origin-response function that looks for a cache-control response header coming from your origin. If it exists, pass it back to the client. However if it doesn't exist (or if you want to overwrite it with something else) then you can create the response header there. The edge doesn't care if the cache-control header came from your origin, or from an origin-response Lambda. To the edge, it is all the same.
Another trick you can use in order to avoid caching and still use the default CloudFront behavior, is: Have a dummy unused query parameter that equals to a unique value for each request.
Python example:
import requests
import uuid
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
Note that both calls will reach destination and will not get response from cache since the uuid.uuid4() will always generate a unique value
This works since by default (if not defined otherwise in the Behavior section) the query parameters are part of the cache key
Note: Doing so will avoid cache use, hence your backend might be loaded with requests.

What is it called when two requests are being served from the same cache?

I'm trying to find the technical term for the following (and potential solutions), in a distributed system with a shared cache:
request A comes in, cache miss, so we begin to generate the response
for A
request B comes in with the same cache key, since A is not
completed yet and hasn't written the result to cache, B is also a
cache miss and begins to generate a response as well
request A completes and stores value in cache
request B completes and stores value in cache (over-writing request A's cache value)
You can see how this can be a problem at scale, if instead of two requests, you have many that all get a cache miss and attempt to generate a cache value as soon as the cache entry expires. Ideally, there would be a way for request B to know that request A is generating a value for the cache, and wait until that is complete and use that value.
I'd like to know the technical term for this phenomenon, it's a cache race of sorts.
It's a kind of Thundering Herd
Solution: when first request A comes and fills a flag, if request B comes and finds the flag then wait... After A loaded the data into the cache, remove flag.
If all other request are waked up by the cache loaded event, would trigger all thread "Thundering Herd". So also need to care about the solution.
For example in Linux kernel, only one process would be waked up, even several process depends on the event.

Cache Policy - caching only if request succeeded

I have enabled some cache policies on a few resources end points. System works quite well, response is cached, the following requests hit the cache, cache is correctly refreshed when I set it to be refreshed.
My only concern is that sometimes a client makes a request that does not hit the cache (for example, because the cache must be refreshed), the server in that moment returns an error (it can happen, it's statistic...) and so the cached response is not a "normal" response (e.g. 2xx) but a 4xx, or a 5xx response.
I would like to know if it is possible to cache the response only if, for example, the server response code is 2xx.
I didn't find any example on Apigee docs for doing this, also if there are some parameters for the cache policy called "SkipCachePopulation" that I think I can use for this purpose.
Any suggestion?
Yes, you can use the SkipCachePopulation field of ResponseCache. It uses a condition to determine when the cache population will not occur. Here is an example:
<SkipCachePopulation>response.status.code >= 400</SkipCachePopulation>

What happens if I exceed my browser's ajax request limit?

Will it be smart and queue later requests for submission after earlier requests complete, or will it do something stupid like discard later requests which push it over its maximum number.
Is the answer the same across browsers, or does it vary?
Are you referring to the fact that browsers typically limit the number of simultaneous connections to a particular host (2 is recommended by the HTTP spec)? If so, then yes, all requests will be queued. It's really no different then loading a web page that has a lot of images in it -- the initial load will result in a bunch of new requests, but they may have to wait based on the connection limit. But all of your images do load.
I'm not aware of an ajax-specific request limit.

Resources