Cache-Control public or private by default? - caching

If I don't specify public or private directive in the Cache-Control header, what's the default behavior? Can it be cached by proxy servers or not?

Found an answer to this on webmasters.stackexchange.com. Quote:
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.3:
The max-age directive on a response implies that the response is
cacheable (i.e., "public") unless some other, more restrictive cache
directive is also present.
It's conceivable (likely?) that there are
proxies in the wild which break this but since the only failure mode
could be treating a public resource as private the consequences should
be minimal beyond a modest performance hit. You'll have far more
problems with proxies which do things like cache resources far beyond
your specified max-age.

Related

RefreshHit from cloudfront even with cache-control: max-age=0, no-store

Cloudfront is getting a RefreshHit for a request that is not supposed to be cached at all.
It shouldn't be cached because:
It has cache-control: max-age=0, no-store;
The Minimum TTL is 0; and
I've created multiple invalidations (on /*) so this cached resource isn't from some historical deploy
Any idea why I'm getting RefreshHits?
I also tried modifying Cache-Control to be cache-control no-store, stale-if-error=0, creating a new invalidation on /* and now I'm seeing a cache hit (this time in Firefox):
After talking extensively with support, they explained what's going on.
So, if you have no-store and a Minimum TTL of 0, then CloudFront will indeed not store your resources. However, if your Origin is taking a long time to respond (so likely under heavy load), while CloudFront waits for the response to the request, if it gets another identical request (identical with respect to the cache key), then it'll send the one response to both requests. This is in order to lighten the load on the server. (see docs)
Support was calling these "collapse hits" although I don't see that in the docs.
So, it seems you can't have a single Behavior serving some pages that must have a unique response per request while serving other pages that are cached. Support said:
I just confirmed that, with min TTL 0 and cache-control: no-store, we cannot disable collapse hit. If you do need to fully disable cloudfront cache, you can use cache policy CachingDisabled
We'll be making a behavior for every path prefix that we need caching on. It seems there was no better way than this for our use-case (transitioning our website one page at a time from a non-cacheable, backend-rendered jinja2/jQuery to a cacheable, client-side rendered React/Next.js).
It's probably too late for OP's project, but I would personally handle this with a simple origin-response Lambda#Edge function, and a single cache behavior for /* and cache policy. You can write all of the filtering/caching logic in the origin-response function. That way you only manage one bit of function code in one place, instead of a bunch of individual cache behaviors (and possibly a bunch of cache policies).
For example, an origin-response function that looks for a cache-control response header coming from your origin. If it exists, pass it back to the client. However if it doesn't exist (or if you want to overwrite it with something else) then you can create the response header there. The edge doesn't care if the cache-control header came from your origin, or from an origin-response Lambda. To the edge, it is all the same.
Another trick you can use in order to avoid caching and still use the default CloudFront behavior, is: Have a dummy unused query parameter that equals to a unique value for each request.
Python example:
import requests
import uuid
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
Note that both calls will reach destination and will not get response from cache since the uuid.uuid4() will always generate a unique value
This works since by default (if not defined otherwise in the Behavior section) the query parameters are part of the cache key
Note: Doing so will avoid cache use, hence your backend might be loaded with requests.

Spring JCache logging cache hits

I have a method on which I added a cache by adding the #CacheResult annotation (I actual created a proxy because I can't change the original implementation of SomethingService):
#Service
public class SomethingServiceProxyImpl implements SomethingService {
#Autowired
#Qualifier("somethingService")
SomethingService somethingService;
#Override
#CacheResult(cacheName = "somethingCache", exceptionCacheName = "somethingExceptionCache", cachedExceptions = { SomeException.class })
public SomePojo someMethod(String someArg) {
return somethingService.someMethod(someArg);
}
}
What I need now, is to be able to log cache hits, meaning cases where the result returned was the one from the cache. I've looked at Spring Cache, at JCache and EHCache (the implementation I use) and I've only found way to listen (with listeners) to the following events: CREATED, UPDATED, REMOVED, EVICTED, EXPIRED but none of them have an event for when the cache returned a result (not null).
I don't really want to have to change the implementation to use the cache programatically instead of using the annotations (I actually have a lot of services to change, not just the one), is there a good way to log those events anyway?
Thoughts about that topic. Probably, the first two are the most relevant:
Don't: The code that gets executed in Spring and the respective cache on a cache hit, is the most performance critical one. That's why it is not so clever to let call additional code in that case, or even have an option for that. Wiring in a log will impact your performance massively. Usually there is already logging in an application for everything that leads to a cache request (e.g. incoming web requests). To get an idea whether the cache is working correctly, a counter of the hits is enough. That is available via the JCache JMX Statistics.
Logging adapter: Using Spring, you can write a Cache adapter which does the logging as you need it and wire it in via configuration. Rough idea: Look at the CacheManager and Cache interfaces. Wrap the CacheManager create cache method and return a wrapped cache with logging.
Hack via ExpiryPolicy: When a custom ExpiryPolicy is specified a JCache implementation calls the method getExpiryForAccess on every cache access. However, you don't get any information on the actual key being requested. I also recommend staying away from own ExpiryPolicy implementations, because of performance reasons. So this is just for completeness.
Logging cache / log every access: In case you specify multiple caches, Spring calls them one after another. You could wire in a dummy cache as first cache, which just logs the access.

How long should static file be cached?

I'd like to set browser caching for some Amazon S3 files. I plan to use this meta data:
Cache-Control: max-age=86400, must-revalidate
that's equal to one day.
Many of the examples I see look like this:
Cache-Control: max-age=3600
Why only 3600 and why not use must-revalidate?
For a file that I rarely change, how long should it be cached?
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Why only 3600 ?
Assumingly because the author of that particular example decided that one hour was an appropiate cache timeout for that page.
Why not use must-revalidate ?
If the response does not contain information that is strictly required to follow the cache rules you set, omitting must-revalidate could in theory ensure that a few more requests are delivered through the cache. See this answer for details, the most relevant part being from the HTTP spec:
When a cache has a stale entry that it would like to use as a response
to a client's request, it first has to check with the origin server
(or possibly an intermediate cache with a fresh response) to see if
its cached entry is still usable.
For a file that I rarely change, how long should it be cached?
Many web performance advices says to set a very far into the future cache expiration, such as a few years. This way, the client browser will only download the data once, and subsequent visits will be served from the cache. This works well for "truly static" files, such as Javascript or CSS.
On the other hand, if the data is dynamic, but does not change too often, you should set an expiration time that is reasonable based for your specific scenario. Do you need to get the newest version to the customer as soon as it's available, or is it okay to serve a stale version ? Do you know when the data change ? Etc. An hour or a day is often appropiate trade-offs between server load, client performance, and data freshness, but it depends on your requirements.
What happens if I update the file and need that update to be seen immediately, but its cache doesn't expire for another 5 days?
Give the file a new name, or append a value to the querystring. You will of course need to update all links. This is the general approach when static resources need to change.
Also, here is a nice overview of the cache control attributes available to you.

Cloudfront private content + signed urls architecture

Let me start out with a quick introduction to the architecture of a system I'm considering migrating to S3+Cloudfront.
We have a number of entities order in a tree. The leaves of the tree has a number of resources (jpg images to be specific), usually in the order of 20-5000, with an average of ~200. Each resource has a unique URL that is served through our colo setup today.
I could just transfer all of these resources to S3, setup Cloudfront on top of that and be done. If only I didn't have to protect the resources.
Most entities are public (that is, ~99%), the rest af protected in one of many ways (login, ip, time, etc.). Once an entity is protected, all the resources must be protected too, and can only be accessed after a valid authorization has been performed.
I could solve this by creating two S3 buckets - one private and one public. For the private content I'd generate signed Cloudfront URL's after the user was authorized. However, the state of an entity might change from public to private arbitrarily, and vice versa. An admin of the system might change an entity at any level of the entity tree, thus causing a cascading change throughout the tree. One change might cause a change of ~20k entities, multiplied by 200 resources, that would affect 4 million resources.
I could run a service in the background monitoring for state changes, but that would be cumbersome, and changing the ACLs of 4 million S3 items would take considerable time, and while that's happening we'll either have unprotected private content, or public content that we'd have to generate signed URLs for.
Another possibility would be to make all resources private by default. On each and every request made to an entity, we would generate a custom policy granting access, for that specific user, to all resources contained in the entity (by using wildcard url's in the custom policy). This would require the creation of a policy for each visitor, per entity - that wouldn't be a problem though. However, that would mean that our users can't cache anything any longer, as the URL will change for each new session. While not a problem for private content, it would suck for us to ditch all caching for the ~99% of the entities that are public.
Yet another option would be to keep all content private and use the above approach for private entities. For public entities we could generate a single custom policy, per public entity, that all users would share. If we set a lifetime of 6 hours and made sure to generate a new policy after 5 hours, a user would be ensured a policy lifetime of at least one hour. This has the advantage of enabling caching for up to 6 hours, while allowing private content to, possibly, be public for up to 6 hours after a state change. This would be acceptable, but I'm not sure it's worth it (trying to work out the cache/hit ratio of requests currently). Obviously we could tweak the 5/6 hour border to enable longer/shorter cache at the cost of longer/shorter exposure to private entities.
Has anyone deployed a similar solution? Any AWS features I'm overlooking that might be of use? Any comments in general?
Based on popular request, I'm answering this question myself.
After gathering relevant metrics and doing some calculations, we ended up concluding we could live with less caching, offset by the faster object serving speed of CloudFront. The actual implementation is detailed on my blog: How to Set Up and Serve Private Content Using S3 and Amazon CloudFront
Assets in the same bucket can have different privacy policies.
So you can have public and private assets in the same bucket.
At upload time, just set the privacy setting.
Then just sign the URL to access the private assets.

How do you handle browser cache with login/logout?

To improve performances, I'd like to add a fairly long Cache-Control (up to 30 minutes) to each page since they do not change often. However, each page also displays the name of the user logged in (like this website).
The problem is when the user logs in or logs out: the user name must change. How can I change the user name after each login/logout action while keeping a long Cache-Control?
Here are the solutions I can think of:
Ajax request (not cached) to retrieve and display the user name. If I have 2 requests (/user?registered and /user?new), they could be cached as well. But I am afraid this extra request would nullify my caching performance-wise
Add a unique URL variable (?time=) to make the URL different, and cancel the cache. However, I would have to add this variable to all links on my webpage, not very convenient code-wise
This problems becomes greater if I actually have more content that is not the same for registered users and new users.
Cache-Control: private
Is usually enough in practice. It's what SO uses.
In theory, if you needed to allow for the case of variable logins from the same client you should probably set Vary on Cookie (assuming that's the mechanism you're using for login). However, this value of Vary (along with most others) messes up IE's caching completely so it's generally avoided. Also, it's often desirable to allow the user to step through the back/forward list including logged-in/out pages without having to re-fetch.
For situations where enforcing proper logged-in-ness for every page is critical (such as banking), an full Cache-Control: no-cache is typically used instead.

Resources