How does validation work in case of a browser cache, proxy cache and an origin server? - validation

See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching#Freshness
when the cache receives a request for a stale resource, it forwards this request with a If-None-Match to check if it is in fact still fresh. If so, the server returns a 304 (Not Modified) header without sending the body of the requested resource, saving some bandwidth.
Let's assume we have: a browser cache, proxy cache and an origin server:
The browser cache contains a stored stale resource with entity-tag "A".
The proxy cache contains a stored stale resource with entity-tag "B". The proxy cache can act as a client, and as a server.
This can for example be the case if you're just starting to use a proxy cache. What will happen in this case?
The browser will send a conditional request with If-None-Match: "A".
The proxy cache receives the conditional request.
The proxy cache will forward this request (according to the quote above). This is because the stored resource in proxy cache is stale.
The origin server receives the request with the entity-tag "A".
Let's say, the resource on the origin server contains entity-tag "A". Now the server will respond with a 304 Not Modified response.
At this point, I don't understand things anymore, so maybe I misunderstood something before? The 304 response is okay for the browser cache, because it contains the same resource as on the origin server (same entity-tag). However, the proxy cache contains an older resource (with a different Etag). If the proxy cache would receive the 304 response (and would update its metadata), then the proxy cache makes a resource valid again while it's an old resource.
This is not desirable, so probably I made a mistake somewhere? How does it actually work? How I have to see this process?

Have a look at the section 4.3 of the RFC7234 spec. Section 4.3.2 in particular says the following:
When a cache decides to revalidate its own stored responses for a
request that contains an If-None-Match list of entity-tags, the cache
MAY combine the received list with a list of entity-tags from its own
stored set of responses (fresh or stale) and send the union of the
two lists as a replacement If-None-Match header field value in the
forwarded request. If a stored response contains only partial
content, the cache MUST NOT include its entity-tag in the union
unless the request is for a range that would be fully satisfied by
that partial stored response. If the response to the forwarded
request is 304 (Not Modified) and has an ETag header field value with
an entity-tag that is not in the client's list, the cache MUST
generate a 200 (OK) response for the client by reusing its
corresponding stored response, as updated by the 304 response
metadata (Section 4.3.4).
So the proxy can send both entity tags (A and B) to the origin server for validation. If the resource representation hasn't changed, the origin server will send a 304 response. If the entity tag in that response is B, the proxy can freshen its stale, stored response and use it to send a 200 OK response to the client. Upon receiving this new response, the browser can update its cache with it.
Now, in the scenario you have specified, the 304 NOT MODIFIED response contains the entity tag A (can such a scenario even occur, given that you are accessing the resources through the proxy?). The spec doesn't seem to address this specific case explicitly, but I guess you can just forward the 304 NOT MODIFIED response to the browser. Upon receiving it, the browser can freshen the stale response using its meta data.

Related

CloudFront / S3 ETag: Possible for CloudFront to send updated S3 Object before the CF TTL has expired?

I have a question in regard to how CloudFront will use an S3 object's ETag to determine if it needs to send a refreshed object or not.
I know that the ETag will be part of the Request to the CloudFront distribution, in my case I'm seeing the "weak" (shortened) version:
if-none-match: W/"eabcdef4036c3b4f8fbf1e8aa81502542"
If this ETag being sent does not match the S3 Object's current ETag value, then the CloudFront will send the latest version.
I'm seeing this work as expected, but only after the CloudFront's cache policy has been reached. In my case it's been set to 20 mins.
CloudFront with a Cache Policy:
Minimum TTL: 1
Maximum TTL: 1200 <-- (20 mins)
Default TTL: 900
Origin Request Policy is not set
S3 Bucket:
Set to only allow access via its corresponding CloudFront
distribution above.
Bucket and objects not public
The test object (index.html) in this case has only one header set:
Content-Type = text/html
While I am using the CloudFront's Cache Policy, I've also tested
using the S3 Object header of Cache-Control = max-age=6000
This had no affect on the refresh of the "index.html" object in
regard to the ETag check I'm asking about.
The Scenario:
Upon first "putObject" to that S3 bucket, the "index.html" file has an ETag of:
eabcdef4036c3b4f8fbf1e8aa81502542
When I hit the URL (GET) for that "index.html" file, the cache of 20 mins is effectively started.
Subsequent hits to the "index.html" URL (GET) has the Request with the value
if-none-match: W/"eabcdef4036c3b4f8fbf1e8aa81502542"
I also see "x-cache: Hit from cloudfront" in the Response coming back.
Before the 20 mins is up, I'll make a change to the "index.html" file and re-upload via a "putObject" command in my code.
That will then change the ETag to:
exyzcde4099c3b4f8fuy1e8aa81501122
I would expect then that the next Request to CloudFront, before the 20-minute TTL and with the old "if-none-match" value, would then prompt the CloudFront to see the ETag is different and send the latest version.
But in all cases/tests it doesn't. CloudFront will seem to ignore the ETag difference and continue to send the older "index.html" version.
It's only after the 20 mins (cache TTL) is up that the CloudFront sends the latest version.
At that time the ETag in the Request changes/updates as well:
if-none-match: W/"exyzcde4099c3b4f8fuy1e8aa81501122"
Question (finally, huh?):
Is there a way to configure CloudFront to listen to the incoming ETag, and if needed, send the latest Object without having to wait for the Cache Policy TTL to expire?
UPDATE:
Kevin Henry's response explains it well:
"CloudFront doesn't know that you updated S3. You told it not to check with the origin until the TTL has expired. So it's just serving the old file until the TTL has expired and it sees the new one that you uploaded to S3. (Note that this doesn't have anything to do with ETags)."
So I decided to test how the ETag would be used if I turned the CloudFront Caching Policy to a TTL of 0 for all three CloudFront settings. I know that this defeats the purpose, and one of the strengths, of CloudFront, but I'm still wrapping my head around certain key aspects of CDN caching.
After setting the cache to 0, I'm seeing a continual "Miss from CloudFront" in the Response coming back.
I expected this, and in the first response I see a HTTP status of 200. Note the file size being returned is 128KB for this test.
Subsequent calls to this same file return a HTTP status of 304, with a file size being returned around 400B.
As soon as I update the "index.html" file in the S3 bucket, and call that same URL, the status code is 200 with a file size of 128KB.
Subsequent calls return a status of 304, again with an average of 400B in file size.
Looking again at the definition of an HTTP status of 304:
https://httpstatuses.com/304
"A conditional GET or HEAD request has been received and would have resulted in a 200 OK response if it were not for the fact that the condition evaluated to false.
In other words, there is no need for the server to transfer a representation of the target resource because the request indicates that the client, which made the request conditional, already has a valid representation; the server is therefore redirecting the client to make use of that stored representation as if it were the payload of a 200 OK response."
So am I correct in thinking that I'm using the Browser's cache at this point?
The calls to the CloudFront will now pass the requests to the Origin, where the ETag is used to verify if the resource has changed.
As it hasn't, then a 304 is returned and the Browser kicks in and returns its stored version of "index.html".
Would this be a correct assumption?
In case you're wondering, I can't use the invalidation method for clearing cache, as my site could expect several thousand invalidations a day. I'm hosting a writing journal site, where the authors could update their files daily, therefore producing new versions of their work on S3.
I would also rather not use the versioning method, with a timestamp or other string added as a query to the page URL. SEO reasons for this one mainly.
My ideal scenario would be to serve the same version of the author's work until they've updated it, at which time the next call to that same page would show its latest version.
This research/exercise is helping me to learn and weigh my options.
Thanks again for the help/input.
Jon
"I would expect then that the next Request to CloudFront, before the 20-minute TTL and with the old if-none-match value, would then prompt the CloudFront to see the ETag is different and send the latest version."
That is a mistaken assumption. CloudFront doesn't know that you updated S3. You told it not to check with the origin until the TTL has expired. So it's just serving the old file until the TTL has expired and it sees the new one that you uploaded to S3. (Note that this doesn't have anything to do with ETags).
CloudFront does offer ways to invalidate the cache, and you can read more about how to combine that with S3 updates in these answers.
We can enable bucket versioning and object with new etag is picked up by the cloudfront

HTTP Cache Validation

I read Http spec. but I have a doubt and I hope someone can help me.
When a cache receives a request and has a stored response that must be validated (before being served to the received request), does the cache send the received request (adding the conditional header fields it needs for validation) to the next server OR does the cache generate a new request (with conditional header fields it needs for validation) and send the generated request to the next server?
Thank you very much! :)
I think the idea is that the client would issue the request with the key headers, and the server would either respond with the content or a 304 to use whatever was in the local cache.
This behavior should be the same for upstream caches along the network path all the way to the source of truth.
"When a cache receives a request..."
Cache doesn't receive HTTP request. It is user-agent (browser) that check cache to see whether there is any cache entry matched for an HTTP request. Cache itself is just a bunch of data stored in disk/memory.
"Does the cache send the received request...OR does the cache generate a new request..."
Cache doesn't send HTTP request. It is user-agent (browser)'s job to send the request.
In summary, cache is just bytes of data, it doesn't know when and where HTTP request is sent. All cache validation logic (cache related HTTP headers) is implemented by user-agent.

How to invalidate the cache of an arbitrary URL?

According to http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10 clients must invalidate the cache associated with a URL after a POST, PUT, or DELETE request.
Is it possible to instruct a web browser to invalidate the cache of an arbitrary URL, without making an HTTP request to it?
For example:
PUT /companies/Nintendo creates a new company called "Nintendo"
GET /companies lists all companies
Every time I create a new company, I want to invalidate the cache associated with GET /companies. The browser doesn't do this automatically because the two operate on different URLs.
Is the Cache-Control mechanism inappropriate for this situation? Should I use no-cache along with ETag instead? What is the best-practice for this situation?
I know I can pass no-cache the next time I GET /companies but that requires the application to keep track URL invalidation instead of pushing the responsibility to the browser. Meaning, I want to invalidate the URL after step 1 as opposed to having to persist this information and applying it at step 2. Any ideas?
Yes, you can (within the same domain). From this answer (slightly paraphrased):
In response to a PUT or POST request, if the Content-Location header URI is different from the request URI, then the cache for the Content-Location URI is invalidated.
So in your case, include a Content-Location: /companies header in response to your POST request. This will invalidate the browser's cached version of /companies.
Note that this does not work for GET requests.
No, in HTTP/1.1 you may only invalidate a client's cache for a resource in a response to a request for that resource. It may be in response to a PUT, POST or DELETE rather than a GET (see RFC 7234, section 4.4 for details).
If you have a resource where you need clients to confirm that they have the latest version then no-cache and an entity tag is an ideal solution.
HTTP/2 allows for pushing a cache clear (Nine Things to Expect from HTTP/2 4. Cache Pushing).
In the link which you have given "the phrase "invalidate an entity" means that the cache will either remove all instances of that entity from its storage, or will mark these as "invalid" and in need of a mandatory revalidation before they can be returned in response to a subsequent request.". Now the question is where are the caches? I believe the Cache the article is talking about is the server cache.
I have worked on a project in VC++ where whenever a model changes the cache is updated. There is a programming logic implemention involved to achieve this. Your mentioned article rightly says "There is no way for the HTTP protocol to guarantee that all such cache entries are marked invalid" HTTP Protocol cannot invalidate cache on its own.
In our project example we used publish subscribe mechanism. Wheneven an Object of class A is updated/inserted it is published to a bus. The controllers register to listen to objects on the Bus. Suppose A Controller is interested in Object A changes, it will not be called back whenever Object Type B is changed and published. When Object Type A indeed is changed and published then Controller A Listener function updates the Cache with latest changes of Object A. The subsequent request of GET /companies will get the latest from the cache. Now there is a time gap between changing the object A and the Cache being refreshed with the latest changes. To avoid something wrong happening in this time gap Object is marked dirty before the Object A Changes. So a request coming inbetween of these times will wait for dirty flag being cleared.
There is also a browser cache. I remember ETAGS are used to validate this. ETAG is the checksum of the resource. For this Client should maintain old ETAG value somehow. If the checksum of resource has changed then the new resource with HTTP 200 is sent else HTTP 304 (use local copy) is sent.
[Update]
PUT /companies/Nintendo
GET /companies
are two different resources. Your the cache for /companies/Nintendo is only expected to be updated and not /companies (I am talking of client side cache) when PUT /companies/Nintendo request is executed. Suppose you call GET /companies/Nintendo next time, based on http headers the response is returned. GET /companies is a brand new request as it points to different resource.
Now question is what should be the http headers? It is purely application specific. Suppose it is stock quote I would not cache. Suppose it is NEWS item I would cache for certain time. Your reference link http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html has all the details of Cache http headers. Only thing not mentioned much is ETag usage. ETag can have checksum of resource. Check http://en.wikipedia.org/wiki/HTTP_ETag and also check https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers

Does a server have to carry out an operation before redirection?

In Is HTTP 303 acceptable for other HTTP methods? we established that HTTP 303 can be used for other HTTP methods.
The Post/Redirect/Get pattern requires the server to carry out an operation before returning HTTP 303. Is the same true for HTTP PUT and DELETE for this and other types of redirects? Is the server required to carry out the operation before redirection? Or can it assume that the client will repeat the request on the canonical URL as necessary?
This becomes even more interesting when you consider the fact that redirection is often used for load-balancing.
Quoting RESTful Web Services page 378:
303 ("See Other")
The request has been processed, but instead of the server sending a response document,
it’s sending the client the URI of a response document. This may be the URI to a static
status message, or the URI to some more interesting resource.
A few pages later...
307 (“Temporary Redirect”)
The request has not been processed, because the requested resource is not home: it’s
located at some other URI. The client should resubmit the request to another URI.
For GET requests, where the only thing being requested is that the server send a representation, this status code is identical to 303 (“See Other”). A typical case where 307 is a good response to a GET is when the server wants to send a client to a mirror site. But for POST, PUT, and DELETE requests, where the server is expected to take some
action in response to the request, this status code is significantly different from 303.
A 303 in response to a POST, PUT, or DELETE means that the operation has succeeded
but that the response entity-body is not being sent along with this request. If the client
wants the response entity-body, it needs to make a GET request to another URI.
A 307 in response to a POST, PUT, or DELETE means that the server has not even tried
to perform the operation. The client needs to resubmit the entire request to the URI in
the Location header.
An analogy may help. You go to a pharmacy with a prescription to be filled. A 303 is
the pharmacist saying “We’ve filled your prescription. Go to the next window to pick
up your medicine.” A 307 is the pharmacist saying “We can’t fill that prescription. Go
to the pharmacy next door.”

What does a proper etag response look like?

We were wondering,
Header wise... What does a proper etag response look like.
Etag response, in the sense that an e-tagged request is made, and yes it matches the etag on our end, thus no content must be sent.
Does it need to contain a content-length header?
Do we use a 304 header response?
Claritifaction:
We want to etag handle via php.
The flow is as follows:
a) Etagged request comes in.
b) PHP checks the etag to see if it meets what we think is a proper condition NOT to send back a full document body.
c) What do we manually send back via php to signal to the browser to use the cached content?
Thanks!
Answered my own question after reading the wiki as lanzz sudgested in the comment above:
"In this subsequent request, the server may now compare the client's
ETag with the ETag for the current version of the resource. If the
ETag values match, meaning that the resource has not changed, then the
server may send back a very short response with an HTTP 304 Not
Modified status. The 304 status tells the client that its cached
version is still good and that it should use that."
So yes, a 304 response is the correct way to answer an etagged request if you want the user agent to use the cached copy.

Resources