I am trying to determine when newly set cache-control headers will be read by end-users who have previously cached a page.
Let's say a user loads a page that does not have any cache-control headers set. Then I add cache-control: no-cache, no-store header at the server level. Will it force even the users who had previously visited and cached the page to get the latest version? Or would their current version have to expire per their browsers rules since no headers were initially set?
The latter. Headers aren't pushed unless a user agent requests a resource. However, see this question. If a client makes a conditional request to validate its cache, those headers will also be sent in 304 responses. The spec says the cache MUST
use other header fields provided in the 304 (Not Modified)
response to replace all instances of the corresponding header
fields in the stored response.
Related
I have a question in regard to how CloudFront will use an S3 object's ETag to determine if it needs to send a refreshed object or not.
I know that the ETag will be part of the Request to the CloudFront distribution, in my case I'm seeing the "weak" (shortened) version:
if-none-match: W/"eabcdef4036c3b4f8fbf1e8aa81502542"
If this ETag being sent does not match the S3 Object's current ETag value, then the CloudFront will send the latest version.
I'm seeing this work as expected, but only after the CloudFront's cache policy has been reached. In my case it's been set to 20 mins.
CloudFront with a Cache Policy:
Minimum TTL: 1
Maximum TTL: 1200 <-- (20 mins)
Default TTL: 900
Origin Request Policy is not set
S3 Bucket:
Set to only allow access via its corresponding CloudFront
distribution above.
Bucket and objects not public
The test object (index.html) in this case has only one header set:
Content-Type = text/html
While I am using the CloudFront's Cache Policy, I've also tested
using the S3 Object header of Cache-Control = max-age=6000
This had no affect on the refresh of the "index.html" object in
regard to the ETag check I'm asking about.
The Scenario:
Upon first "putObject" to that S3 bucket, the "index.html" file has an ETag of:
eabcdef4036c3b4f8fbf1e8aa81502542
When I hit the URL (GET) for that "index.html" file, the cache of 20 mins is effectively started.
Subsequent hits to the "index.html" URL (GET) has the Request with the value
if-none-match: W/"eabcdef4036c3b4f8fbf1e8aa81502542"
I also see "x-cache: Hit from cloudfront" in the Response coming back.
Before the 20 mins is up, I'll make a change to the "index.html" file and re-upload via a "putObject" command in my code.
That will then change the ETag to:
exyzcde4099c3b4f8fuy1e8aa81501122
I would expect then that the next Request to CloudFront, before the 20-minute TTL and with the old "if-none-match" value, would then prompt the CloudFront to see the ETag is different and send the latest version.
But in all cases/tests it doesn't. CloudFront will seem to ignore the ETag difference and continue to send the older "index.html" version.
It's only after the 20 mins (cache TTL) is up that the CloudFront sends the latest version.
At that time the ETag in the Request changes/updates as well:
if-none-match: W/"exyzcde4099c3b4f8fuy1e8aa81501122"
Question (finally, huh?):
Is there a way to configure CloudFront to listen to the incoming ETag, and if needed, send the latest Object without having to wait for the Cache Policy TTL to expire?
UPDATE:
Kevin Henry's response explains it well:
"CloudFront doesn't know that you updated S3. You told it not to check with the origin until the TTL has expired. So it's just serving the old file until the TTL has expired and it sees the new one that you uploaded to S3. (Note that this doesn't have anything to do with ETags)."
So I decided to test how the ETag would be used if I turned the CloudFront Caching Policy to a TTL of 0 for all three CloudFront settings. I know that this defeats the purpose, and one of the strengths, of CloudFront, but I'm still wrapping my head around certain key aspects of CDN caching.
After setting the cache to 0, I'm seeing a continual "Miss from CloudFront" in the Response coming back.
I expected this, and in the first response I see a HTTP status of 200. Note the file size being returned is 128KB for this test.
Subsequent calls to this same file return a HTTP status of 304, with a file size being returned around 400B.
As soon as I update the "index.html" file in the S3 bucket, and call that same URL, the status code is 200 with a file size of 128KB.
Subsequent calls return a status of 304, again with an average of 400B in file size.
Looking again at the definition of an HTTP status of 304:
https://httpstatuses.com/304
"A conditional GET or HEAD request has been received and would have resulted in a 200 OK response if it were not for the fact that the condition evaluated to false.
In other words, there is no need for the server to transfer a representation of the target resource because the request indicates that the client, which made the request conditional, already has a valid representation; the server is therefore redirecting the client to make use of that stored representation as if it were the payload of a 200 OK response."
So am I correct in thinking that I'm using the Browser's cache at this point?
The calls to the CloudFront will now pass the requests to the Origin, where the ETag is used to verify if the resource has changed.
As it hasn't, then a 304 is returned and the Browser kicks in and returns its stored version of "index.html".
Would this be a correct assumption?
In case you're wondering, I can't use the invalidation method for clearing cache, as my site could expect several thousand invalidations a day. I'm hosting a writing journal site, where the authors could update their files daily, therefore producing new versions of their work on S3.
I would also rather not use the versioning method, with a timestamp or other string added as a query to the page URL. SEO reasons for this one mainly.
My ideal scenario would be to serve the same version of the author's work until they've updated it, at which time the next call to that same page would show its latest version.
This research/exercise is helping me to learn and weigh my options.
Thanks again for the help/input.
Jon
"I would expect then that the next Request to CloudFront, before the 20-minute TTL and with the old if-none-match value, would then prompt the CloudFront to see the ETag is different and send the latest version."
That is a mistaken assumption. CloudFront doesn't know that you updated S3. You told it not to check with the origin until the TTL has expired. So it's just serving the old file until the TTL has expired and it sees the new one that you uploaded to S3. (Note that this doesn't have anything to do with ETags).
CloudFront does offer ways to invalidate the cache, and you can read more about how to combine that with S3 updates in these answers.
We can enable bucket versioning and object with new etag is picked up by the cloudfront
See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching#Freshness
when the cache receives a request for a stale resource, it forwards this request with a If-None-Match to check if it is in fact still fresh. If so, the server returns a 304 (Not Modified) header without sending the body of the requested resource, saving some bandwidth.
Let's assume we have: a browser cache, proxy cache and an origin server:
The browser cache contains a stored stale resource with entity-tag "A".
The proxy cache contains a stored stale resource with entity-tag "B". The proxy cache can act as a client, and as a server.
This can for example be the case if you're just starting to use a proxy cache. What will happen in this case?
The browser will send a conditional request with If-None-Match: "A".
The proxy cache receives the conditional request.
The proxy cache will forward this request (according to the quote above). This is because the stored resource in proxy cache is stale.
The origin server receives the request with the entity-tag "A".
Let's say, the resource on the origin server contains entity-tag "A". Now the server will respond with a 304 Not Modified response.
At this point, I don't understand things anymore, so maybe I misunderstood something before? The 304 response is okay for the browser cache, because it contains the same resource as on the origin server (same entity-tag). However, the proxy cache contains an older resource (with a different Etag). If the proxy cache would receive the 304 response (and would update its metadata), then the proxy cache makes a resource valid again while it's an old resource.
This is not desirable, so probably I made a mistake somewhere? How does it actually work? How I have to see this process?
Have a look at the section 4.3 of the RFC7234 spec. Section 4.3.2 in particular says the following:
When a cache decides to revalidate its own stored responses for a
request that contains an If-None-Match list of entity-tags, the cache
MAY combine the received list with a list of entity-tags from its own
stored set of responses (fresh or stale) and send the union of the
two lists as a replacement If-None-Match header field value in the
forwarded request. If a stored response contains only partial
content, the cache MUST NOT include its entity-tag in the union
unless the request is for a range that would be fully satisfied by
that partial stored response. If the response to the forwarded
request is 304 (Not Modified) and has an ETag header field value with
an entity-tag that is not in the client's list, the cache MUST
generate a 200 (OK) response for the client by reusing its
corresponding stored response, as updated by the 304 response
metadata (Section 4.3.4).
So the proxy can send both entity tags (A and B) to the origin server for validation. If the resource representation hasn't changed, the origin server will send a 304 response. If the entity tag in that response is B, the proxy can freshen its stale, stored response and use it to send a 200 OK response to the client. Upon receiving this new response, the browser can update its cache with it.
Now, in the scenario you have specified, the 304 NOT MODIFIED response contains the entity tag A (can such a scenario even occur, given that you are accessing the resources through the proxy?). The spec doesn't seem to address this specific case explicitly, but I guess you can just forward the 304 NOT MODIFIED response to the browser. Upon receiving it, the browser can freshen the stale response using its meta data.
According to http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10 clients must invalidate the cache associated with a URL after a POST, PUT, or DELETE request.
Is it possible to instruct a web browser to invalidate the cache of an arbitrary URL, without making an HTTP request to it?
For example:
PUT /companies/Nintendo creates a new company called "Nintendo"
GET /companies lists all companies
Every time I create a new company, I want to invalidate the cache associated with GET /companies. The browser doesn't do this automatically because the two operate on different URLs.
Is the Cache-Control mechanism inappropriate for this situation? Should I use no-cache along with ETag instead? What is the best-practice for this situation?
I know I can pass no-cache the next time I GET /companies but that requires the application to keep track URL invalidation instead of pushing the responsibility to the browser. Meaning, I want to invalidate the URL after step 1 as opposed to having to persist this information and applying it at step 2. Any ideas?
Yes, you can (within the same domain). From this answer (slightly paraphrased):
In response to a PUT or POST request, if the Content-Location header URI is different from the request URI, then the cache for the Content-Location URI is invalidated.
So in your case, include a Content-Location: /companies header in response to your POST request. This will invalidate the browser's cached version of /companies.
Note that this does not work for GET requests.
No, in HTTP/1.1 you may only invalidate a client's cache for a resource in a response to a request for that resource. It may be in response to a PUT, POST or DELETE rather than a GET (see RFC 7234, section 4.4 for details).
If you have a resource where you need clients to confirm that they have the latest version then no-cache and an entity tag is an ideal solution.
HTTP/2 allows for pushing a cache clear (Nine Things to Expect from HTTP/2 4. Cache Pushing).
In the link which you have given "the phrase "invalidate an entity" means that the cache will either remove all instances of that entity from its storage, or will mark these as "invalid" and in need of a mandatory revalidation before they can be returned in response to a subsequent request.". Now the question is where are the caches? I believe the Cache the article is talking about is the server cache.
I have worked on a project in VC++ where whenever a model changes the cache is updated. There is a programming logic implemention involved to achieve this. Your mentioned article rightly says "There is no way for the HTTP protocol to guarantee that all such cache entries are marked invalid" HTTP Protocol cannot invalidate cache on its own.
In our project example we used publish subscribe mechanism. Wheneven an Object of class A is updated/inserted it is published to a bus. The controllers register to listen to objects on the Bus. Suppose A Controller is interested in Object A changes, it will not be called back whenever Object Type B is changed and published. When Object Type A indeed is changed and published then Controller A Listener function updates the Cache with latest changes of Object A. The subsequent request of GET /companies will get the latest from the cache. Now there is a time gap between changing the object A and the Cache being refreshed with the latest changes. To avoid something wrong happening in this time gap Object is marked dirty before the Object A Changes. So a request coming inbetween of these times will wait for dirty flag being cleared.
There is also a browser cache. I remember ETAGS are used to validate this. ETAG is the checksum of the resource. For this Client should maintain old ETAG value somehow. If the checksum of resource has changed then the new resource with HTTP 200 is sent else HTTP 304 (use local copy) is sent.
[Update]
PUT /companies/Nintendo
GET /companies
are two different resources. Your the cache for /companies/Nintendo is only expected to be updated and not /companies (I am talking of client side cache) when PUT /companies/Nintendo request is executed. Suppose you call GET /companies/Nintendo next time, based on http headers the response is returned. GET /companies is a brand new request as it points to different resource.
Now question is what should be the http headers? It is purely application specific. Suppose it is stock quote I would not cache. Suppose it is NEWS item I would cache for certain time. Your reference link http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html has all the details of Cache http headers. Only thing not mentioned much is ETag usage. ETag can have checksum of resource. Check http://en.wikipedia.org/wiki/HTTP_ETag and also check https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers
We were wondering,
Header wise... What does a proper etag response look like.
Etag response, in the sense that an e-tagged request is made, and yes it matches the etag on our end, thus no content must be sent.
Does it need to contain a content-length header?
Do we use a 304 header response?
Claritifaction:
We want to etag handle via php.
The flow is as follows:
a) Etagged request comes in.
b) PHP checks the etag to see if it meets what we think is a proper condition NOT to send back a full document body.
c) What do we manually send back via php to signal to the browser to use the cached content?
Thanks!
Answered my own question after reading the wiki as lanzz sudgested in the comment above:
"In this subsequent request, the server may now compare the client's
ETag with the ETag for the current version of the resource. If the
ETag values match, meaning that the resource has not changed, then the
server may send back a very short response with an HTTP 304 Not
Modified status. The 304 status tells the client that its cached
version is still good and that it should use that."
So yes, a 304 response is the correct way to answer an etagged request if you want the user agent to use the cached copy.
Assume browser default settings, and content is sent without expires headers.
user visits website, browser caches images etc.
user does not close browser, or refresh page.
user continues to surf site normally.
assume the browse doesn't dump the cache for any reason.
The browser will cache images etc as the user surfs, but it's unclear when it will issue a conditional GET request to ask about content freshness (apart from refreshing the page). If this is a browser specific setting, where can I see it's value (for browsers like: safari, IE, FireFox, Chrome).
[edit: yes - I understand that you should always send expires headers. However, this research is aimed at understanding how the browser works with content w/o expires headers.]
From the the HTTP caching spec (section 13.4): Unless specifically constrained by a cache-control (section 14.9) directive, a caching system MAY always store a successful response (see section 13.8) as a cache entry, MAY return it without validation if it is fresh, and MAY return it after successful validation. This means that a user agent is free to do whatever it wants if no cache control header is sent. Most browsers use a combination of user settings and heuristics to determine whether (and how long) to cache in this situation.
HTTP/1.1 defines a selection of caching mechanisms; the expires header is merely one, there is also the cache-control header.
To directly answer your question: for a resource returned with no expires header, you must consider the returned cache-control directives.
HTTP/1.1 defines no caching behaviour for a resource served with no cache-related headers. If a resource is sent with no cache-control or expires headers you must assume the client will make a regular (non-conditional) request the next time the same resources is requested.
Any deviation from this behaviour qualifies the client as being not a fully conformant HTTP client, in which case the question becomes: what behaviour is to be expected from a non-conformant HTTP client? There is no way to answer that.
HTTP caching is complex, to fully understand what a conformant client should do in a given scenario, read and understand the HTTP caching spec.
Unless you send an expires header, most browsers will make a GET request for each subsequent refresh and will either get HTTP 200 OK (it will download the content again) or HTTP 304 Not Modified (and use the data in cache).