Can Varnish generate ETags? - caching

Is there some way to have Varnish generate a ETag for a backend response it recieves and add it to the response? I would prefer to have all ETag logic in Varnish instead of configuring this for all my backend nodes individually.
I'm using Varnish 4.0.0.

Etags are not currently implemented in Varnish (see the wiki).

You can create the etag header and its value in VCL if you wish.
sub vcl_backend_response {
if (!beresp.http.Etag) {
set beresp.http.Etag = "W/foo";
}
}
The main issue here is how to make the Etag reflect the body of the object. You'll have to know how your application works to do this safely. One approach could be to feed the Date response header along with the URL to libvmod-digest, and set the hash output as the Etag.
In Varnish 4.0.0 you have (the wiki is outdated) support for If-Modified-Since/If-None-Match to the backend, so if you choose to do this in VCL remember to filter it in vcl_backend_fetch so you don't confuse your backend.
In general I'd advice against doing this in VCL. Adding it on the backend is usually just enabling a module. The actual change in VCL is simple, but this is one of the tricky parts of HTTP and it is easy to get it wrong.

Related

How does Varnish know how long to cache each response for?

Does Varnish simply follow the Cache-Control header from the origin server?
And are there any other ways that you can control how long it caches a response for? For example, can you tell Varnish to cache a response “indefinitely” (i.e. “until further notice”) and then later explicitly instruct it to delete that object from the cache when you know the underlying data has changed?
(Please note: I've never used Varnish; I'm just trying to work out whether it would be a good fit for a forthcoming project.)
Those are very basic questions. I think you should start from reading great docs on https://www.varnish-cache.org/docs/
To answer your question: It depends on how you configure varnish.
You can leave the defaults so it'll use expires;
You can set it up to have different TTL(Time To Live) for each domain/backend/filetype/cookie...
If you set it up with ie. 1year cache TTL, you can remove it from cache by "Purging" specific address/url or whole domain.
You can do so in two ways:
by PURGE HTTP Method if you have it configured in your vcl file
by using purge command in varnishadm/varnish console
https://www.varnish-cache.org/docs/2.1/tutorial/purging.html

High-performance passive-access-optimised dynamic REST web pages

The following question is about a caching framework to be implemented or already existing for the REST-inspired behaviour described in the following.
The goal is that GET and HEAD requests should be handled as efficiently as requests to static pages.
In terms of technology, I think of Java Servlets and MySQL to implement the site. (But emergence of good reasons may still impact my choice of technology.)
The web pages should support GET, HEAD and POST; GET and HEAD being much more frequent than POST. The page content will not change with GET/HEAD, only with POST. Therefore, I want to serve GET and HEAD requests directly from the file system and only POST requests from the servlet.
A first (slightly incomplete) idea is that the POST request would pre-calculate the HTML for successive GET/HEAD requests and store it into the file system. GET/HEAD then would always obtain the file from there. I believe that this could easily be implemented in Apache with conditional URL rewriting.
The more refined approach is that GET would serve the HTML from the file system (and HEAD use it, too), if there is a pre-computed file, and otherwise would invoke the servlet machinery to generate it on the fly. POST in this case would not generate any HTML, but only update the database appropriately and delete the HTML file from the file system as a flag to have it generated anew with the next GET/HEAD. The advantage of this second approach is that it handles more gracefully the “initial phase” of the web pages, where no POST has been called yet. I believe that this lazy-generate-and-store approach could be implemented in Apache by providing an error-handler, which would invoke the servlet in case of “file-not-found-but-should-be-there”.
In a later round of refinement, to save bandwidth, the cached HTML files should also be available in a gzip-ed version which is served when the client understands that. I believe that the basic mechanisms should be the same as for the uncompressed HTML files.
Since there will be many such REST-like pages, both approaches might occasionally need some mechanism to garbage-collect rarely used HTML files in order to save file space.
To summarise, I am confident that my GET/HEAD-optimised architecture can be cleanly implemented. I would like to have opinions on the idea as such in the first place (I believe it is good, but I may be wrong) and whether somebody has already experience with such an architecture, perhaps even knows a free framework implementing it.
Finally, I'd like to note that client caching is not the solution I am after, because multiple different clients will GET or HEAD the same page. Moreover, I want to absolutely avoid the servlet machinery during GET/HEAD requests in case the pre-computed file exists. It should not even be invoked to provide cache-related HTTP headers in GET/HEAD requests nor dump a file to output.
The questions are:
Are there better (standard) mechanisms available to reach the goal stated at the beginning?
If not, does anybody know about an existing framework like the one I consider?
I think that a HTTP cache does not reach my goal. As far as I understand, the HTTP cache would still need to invoke the servlet with a HEAD request in order to learn whether a POST has meanwhile changed the page. Since page changes will come at unpredictable points in time, an HTTP header stating an expiration time is not good enough.
Use Expires HTTP Header and/or HTTP conditional requests.
Expires
The Expires entity-header field gives the date/time after which the response is considered stale. A stale cache entry may not normally be returned by a cache (either a proxy cache or a user agent cache) unless it is first validated with the origin server (or with an intermediate cache that has a fresh copy of the entity). See section 13.2 for further discussion of the expiration model.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Conditional Requests
Decorate cache-able response with Expires,Last-Modified and/or ETag header. Make requests conditional with If-Modified-Since, If-None-Match header, If-*, etc. (see RFC).
e.g.
Last response headers:
...
Expires: Wed, 15 Nov 1995 04:58:08 GMT
...
don't perform new request on the resource before expiration date (the Expires header) and then perform conditional request:
...
If-Modified-Since: Wed, 15 Nov 1995 04:58:08 GMT
...
If the resource wasn't modified then 304 Not Modified response code is returned and the response doesn't have a body. 200 OK and response with body is returned otherwise.
Note: HTTP RFC also defines Cache-Control header
See Caching in HTTP
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html

Asking Chrome to bypass local cache for XmlHttpRequest like it's possible in Firefox?

As some of you may already know, there are some caching issues in Firefox/Chrome for requests that are initiated by XmlHttpRequest object. These issues mean that browser does not strictly follow the rules and does not go to server for the new XSLT file (for example). Response does not have Expires header (for performance reasons we can't use it).
Firefox has additional parameter in the XHR object "channel" to which you put value Components.interfaces.nsIRequest.LOAD_BYPASS_CACHE to go to server explicitly.
Does something like that exist for Chrome?
Let me immediatelly stop everyone who would recommend adding timestamp as a value of GET parameter or random integer - I don't want server to get different URL requests. I want it to get the original URL. Reason is that I want to protect server from getting too many different requests for simple static files and sending too much data to clients when it is not needed.
If you hit static file with generated GET parameter (like '?forcenew=12314') would render 200 response each first time and 304 for every following request for that value of random integer. I want to make requests that will always return 304 if the target static file is identical to client version. This is BTW how web browsers should work out-of-the-box but XHR objects tend to not go to server at all to ask is file changed or not.
In my main project at work I had the same exact problem. My solution was not to append random strings or timestamps to GET requests, but to append a specific string to GET requests.
If you have a revision number e.g. subversion revision or likewise from git/mer or whatever you are using, append that. Static files will get 304 responses until the moment a new revision is released. When the new release happens a single 200 response is granted and it is back to happily generating 304 responses. :-)
This has the added bonus of being browser independent.
Should you be unlucky and not have a revision number, then make one up and increment it each time you make a release.
You should look into Etags, etags are keys that can be generated from the contents of the file therefore once the file on the server changes the system will be a new etag. Obviously this will be a service-side change which is something that you will need to do given that you want a 200 and then subsequent 304's. Chrome and FF should respect these etags so you shouldn't need to do any crazy client-side hacks.
Chrome now supports Cache-Control: max-age=0 request HTTP header. You can set it after you open an XMLHttpRequest instance:
xhr.setRequestHeader( "Cache-Control", "max-age=0" );
This will instruct Chrome to not use cached response without revalidation.
For more information check The State of Browser Caching, Revisited by Mark Nottingham and RFC 7234 Hypertext Transfer Protocol (HTTP/1.1): Caching.

Can I clear a specific URL in the browser's cache (using POST, or otherwise)?

The Problem
There's an item (foo.js) that rarely changes. I'd like this item to be stored in the browser's cache (using Expires header). However, when it does change, I'd like the browser to update to the newest version.
The Attempt
Foo.js is returned with a far future Expires header. It's cached on the browser and requires no round trip query to the server. Just the way I like it. Now, when it changes....
Let's assume I know that the user's version of foo.js is outdated. How can I force a fresh copy of it to be obtained? I use xhr to perform a POST to foo.js. This should, in theory, force the browser to get a newer version of foo.js.
Unfortunately, this only seems to work in Firefox. Other browsers will use their cached version of the copy, even if other POST paramters are set.
WTF
First off, is there a way to do what I'm trying to do?
Second, why is there no sensible key/value type of cache that browser's have? Why can I not simply not include in headers: "Cache: some_key, some_expiration_time" and also specify "Clear-Cache: key1, key2, key3" (the keys must be domain specific, of course). Instead, we're stuck with either expensive round-trips that ask "is content new?", or the ridiculous "guess how long it'll be before you modify something" Expires header.
Thanks
Any comments on this matter are greatly appreciated.
Edits
I realize that adding a version number to the file would solve this. However, in my case it is not possible -- the call to "foo.js" is hardcoded into a bookmarklet.
You can just add a querystring to the end of the file, the server can ignore it, but the browser can't, it must treat it as a new request:
http://www.site.com/foo.js?v=1.12345
Many people use this approach, SO uses a hash of some sort, I use the build number (so users get a new version each build). If either of these is an option, you get the benefit of long duration cache headers, but still force a fetch of a new copy when needed.
Why set your cache expiration so far in the future? If you set it to one day for instance, the only overhead that you will incur (once a day) is the browser revalidating that it is the same file. If you still have not changed it, then you will not re-download the file, the server will respond with a not-modified response.
All caches have a set of rules that
they use to determine when to serve a
representation from the cache, if it’s
available. Some of these rules are set
in the protocols (HTTP 1.0 and 1.1),
and some are set by the administrator
of the cache (either the user of the
browser cache, or the proxy
administrator).
Generally speaking, these are the most
common rules that are followed (don’t
worry if you don’t understand the
details, it will be explained below):
If the response’s headers tell the cache not to keep it, it won’t.
If the request is authenticated or secure (i.e., HTTPS), it won’t be
cached.
A cached representation is considered fresh (that is, able to be
sent to a client without checking with
the origin server) if:
* It has an expiry time or other age-controlling header set, and
is still within the fresh period, or
* If the cache has seen the representation recently, and it was
modified relatively long ago.
Fresh representations are served directly from the cache, without
checking with the origin server.
If an representation is stale, the origin server will be asked to
validate it, or tell the cache whether
the copy that it has is still good.
Under certain circumstances — for example, when it’s disconnected
from a network — a cache can serve
stale responses without checking with
the origin server.
If no validator (an ETag or
Last-Modified header) is present on a
response, and it doesn't have any
explicit freshness information, it
will usually — but not always — be
considered uncacheable.
Together, freshness and validation are
the most important ways that a cache
works with content. A fresh
representation will be available
instantly from the cache, while a
validated representation will avoid
sending the entire representation over
again if it hasn’t changed.
http://www.mnot.net/cache_docs/#BROWSER
There is an excellent suggestion made in this thread: How can I make the browser see CSS and Javascript changes?
See the accepted answer by user, "grom".
The idea is to use the "modified" time stamp from the server to note when the file has been modified, and adding a version parameter to the end of the URL, making your CSS and JS files have URLs like this: my.js?version=12345678
This makes the browser think it is a new file, and so it does not refer to the cached version.
I am using a similar method in my app. It works pretty well. Of course, this would assume you are using something like PHP to process your HTML.
Here is another link with a more simple implementation for WordPress: http://markjaquith.wordpress.com/2009/05/04/force-css-changes-to-go-live-immediately/
With these constraints I guess your only option is to use window.location.reload(true) and force the browser to fresh all the cached items.. it's not pretty
You can invalidate cache on a specific url, using Cache-Control HTML header.
On your desired URL you can run (with xhr/ajax for instance) a request with following headers :
headers: {
'Cache-Control': 'no-cache, no-store, must-revalidate, max-age=0',
Pragma: 'no-cache',
Expires: '0',
}
Your cache will be invalidated, and next GET requests will return a brand new result.

How do I set the don't cache header for an html file using apache?

I'm doing a little bit of ajax where I get a static html file that is actually changed on the disk from time to time. Of course IE has a problem where it wants to help out by caching the file which I don't want. I know how to fix this when grabbing a dynamic file: you just change the header in the dynamic file. But how do I do this for the static html file? Note that I am using apache.
Thanks
At Apache level you can setup the expiry date of the document using the mod_expires module.
From the documentation:
This module controls the setting of the Expires HTTP header and the max-age directive of the Cache-Control HTTP header in server responses. The expiration date can set to be relative to either the time the source file was last modified, or to the time of the client access.
These HTTP headers are an instruction to the client about the document's validity and persistence. If cached, the document may be fetched from the cache rather than from the source until this time has passed. After that, the cache copy is considered "expired" and invalid, and a new copy must be obtained from the source.
More details at http://httpd.apache.org/docs/2.0/mod/mod_expires.html
If you can use mod_expires as Marcel suggested, you can always append a random request parameter.
For example, instead of requesting static_file.html you can request static_file.html?_=1231231231 and change that request parameter every time.
jQuery has a really simple way of doing this:
$.ajax({cache: false, url: static_file.html});

Resources