Getting ETags right - performance

I’ve been reading a book and I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.
I already know what ETags are and understand the risks, but is it that hard to get ETags right?
I’ve just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.
Is using MD5 hash of the response body as ETag wrong? If so, why?
Why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?
This last question is hard to answer unless you are the author :), so I’m trying to find the weak points of using an MD5 hash as an ETag.

ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.
An ETag needs to be a unique value representing the state and specific format of a resource (a resource could have multiple formats that each need their own ETag). Not unique across the entire domain of resources, simply within the resource.
Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.
You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.
Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.
Edit: I see your edit, and clarify.
MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cached is simply wasteful (i.e. dynamic content).
The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.
ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.

We're using etags for our dynamic content in instela.
Our strategy is at the end of output generating the md5 hash of the content to send and if the if-none-match header exists, we compare the header with the generated hash. If the two values are the same we send 304 code and interrumpt the request without returning any content.
It's true that we consume a bit cpu to hash the content but finally we're saving much bandwidth.
We have a facebook newsfeed style main page which has different content for every user. As the newsfeed content changes only 3-4 time per hour, the main page refreshes are so efficient for the client side. In the mobile era I think it's better to spend a bit more cpu time than spending bandwidth. Bandwidth is still more expensive than the CPU, and it's a better experience for the client.

Having not read the book, I can't speak on the author's precise concerns.
However, the generation of ETags should be such that an ETag is only generated once when a page has changed. Generating an MD5 hash of a web page costs processing power and time on the server; if you have many clients connecting, it could start to cause performance problems.
Thus, you need a good technique for generating ETags only when necessary and caching them on the server until the related page changes.

I think the perceived problem with ETAGS is probably that your browser has to issue and parse a (simple and small) request / response for every resource on your page to check if the etag value has changed server side.
I personally find these extra small roundtrips to the server acceptable for often changing images, css, javascript (the server does not need to resend the content if the browser's etag is current) since the mechanism makes it quite easy to mark 'updated' content.

Related

What is the best way to generate a ETag based on the timestamp of the resource

So in one of my projects i have to create a http cache to handle multiple API calls to the server. I read about this ETag header that can be used with a conditional GET to minimize server load and enact caching.. However i have a problem with generating the E-Tag.. I can use the LAST_UPDATED_TIMESTAMP of the resource as the ETag or hash it using some sort of hashing algo like MD5. but what would be the best way to do this? Is there any cons in using raw timestamp as the Etag??
any supportive answer is highly appreciated .. Thanks in advance....Cheers!!
If your timestamp has enough precision so that you can guarantee it will change any time the resource changes, then you can use an encoding of the timestamp (the header value needs to be ascii).
But bear in mind that ETag may not save you much. It's just a cache revalidation header, so you will still get as many requests from clients, just some will be conditional, and you may then be able to avoid sending payload back if the ETag didn't change, but you will still incur some work figuring that out (maybe a bunch less work, so could be worth it).
In fact several versions of IIS used the file timestamp to generate an Etag. We tripped over that when building WinGate's cache module, when a whole bunch of files with the same timestmap ended up with the same Etag, and we learned that an Etag is only valid in the context of the request URI.

How to check for a image update on a website?

How can I check if an image file is changed on a website, from another website and then store it and the old version?
I'm using this to log the images on the server.
This is just a quick sketch of the simplest approach. If you want more detail on some part, just ask in the comments.
Sketch of solution
Download the image, compute a hash for it and store the image in file system and image ID + hash + file system path (and possibly other info such as time of request) in database.
When checking for update, get last available info for the same ID from the database and if hashes are not the same, the image was not updated. If you use cryptographic hash like MD5 or SHA1 and the hash changed, it is almost sure that the image changed too.
Setup a cronjob to run the script periodically.
To download the image, you could use $img = file_get_contents($url);. MD5 can be computed via $hash = md5($img);, SHA1 via $hash = sha1($img);. For storing use file_put_contents($path, $img);.
Optimization
There are several ways to optimize the job.
To cut on memory consumption, download the file directly to file system using file_put_contents($path, $url); and only after that compute the hash using $hash = md5_file($path); or $sha1_file($path);. This is better for larger images. The bad thing is that you have to read the data from file system again to compute the hash, so I think it would be slower.
Side note: Never optimize anything before you know that it really makes the code better. Always measure before, after and compare.
Another optimization could be done for saving on data transfers from remote server if the server sends reliable headers for caching. ETag is the best one because it should be based on the contents of the file. If it does not change, file should be the same. If you want just to check the headers, use $headers = get_headers($url, 1);. To fetch really just the headers, you should issue just HTTP request via HEAD method instead of GET. Check get_headers() manual for more info. To check the headers while getting response body, use file_get_contents() along with $http_response_header special variable.
Issuing requests indicating that you cached the image on last visit (via If-Modified-Since et al.) could serve the same purpose.
Etiquette and legal aspects
I told you how. Now I’ll tell you when (not).
Do not abuse the remote server. Remember that its owner has expenses to keep it up and running and definitely does not want to let it be occupied by your scripts for more than a negligible amount of time, transferring not much data. Adapt your polling period to type of target server and size of image. Adapting it to estimated frequency of change is not a bad idea too.
Be sure to have consent of image rights holder when storing its copy. Licensing can be a messy thing. Be careful, otherwise you can get into trouble.
If you plan to somehow crawl for the images, robots.txt standard might be of your interest. This file could tell you that you are not welcome and you should respect it.
Related questions
Some are related more, some less. People want to watch mainly HTML pages. That has other specifics, which is also why I did not flag this question as duplicate of one of these.
https://stackoverflow.com/q/11336431/2157640
https://stackoverflow.com/q/11182825/2157640
https://stackoverflow.com/q/13398512/2157640
https://stackoverflow.com/q/15207145/2157640
https://stackoverflow.com/q/1494488/2157640

Can HTTP headers be too big for browsers?

I am building an AJAX application that uses both HTTP Content and HTTP Header to send and receive data. Is there a point where the data received from the HTTP Header won't be read by the browser because it is too big ? If yes, what is the limit and is it the same behaviour in all the browser ?
I know that theoretically there is no limit to the size of HTTP headers, but in practice what is the point that past that, I could have problem under certain platform, browsers or with certain software installed on the client computer or machine. I am more looking into a guide-line for safe practice of using HTTP headers. In other word, up to what extend can HTTP headers be used for transmitting additional data without having potential problem coming into the line ?
Thanks, for all the input about this question, it was very appreciated and interesting. Thomas answer got the bounty, but Jon Hanna's answer brought up a very good point about the proxy.
Short answers:
Same behaviour: No
Lowest limit found in popular browsers:
10KB per header
256 KB for all headers in one response.
Test results from MacBook running Mac OS X 10.6.4:
Biggest response successfully loaded, all data in one header:
Opera 10: 150MB
Safari 5: 20MB
IE 6 via Wine: 10MB
Chrome 5: 250KB
Firefox 3.6: 10KB
Note
Those outrageous big headers in Opera, Safari and IE took minutes to load.
Note to Chrome:
Actual limit seems to be 256KB for the whole HTTP header.
Error message appears: "Error 325 (net::ERR_RESPONSE_HEADERS_TOO_BIG): Unknown error."
Note to Firefox:
When sending data through multiple headers 100MB worked fine, just split up over 10'000 headers.
My Conclusion:
If you want to support all popular browsers 10KB per header seems to be the limit and 256KB for all headers together.
My PHP Code used to generate those responses:
<?php
ini_set('memory_limit', '1024M');
set_time_limit(90);
$header = "";
$bytes = 256000;
for($i=0;$i<$bytes;$i++) {
$header .= "1";
}
header("MyData: ".$header);
/* Firfox multiple headers
for($i=1;$i<1000;$i++) {
header("MyData".$i.": ".$header);
}*/
echo "Length of header: ".($bytes / 1024).' kilobytes';
?>
In practice, while there are rules prohibitting proxies from not passing certain headers (indeed, quite clear rules on which can be modified and even on how to inform a proxy on whether it can modify a new header added by a later standard), this only applies to "transparent" proxies, and not all proxies are transparent. In particular, some wipe headers they don't understand as a deliberate security practice.
Also, in practice some do misbehave (though things are much better than they were).
So, beyond the obvious core headers, the amount of header information you can depend on being passed from server to client is zero.
This is just one of the reasons why you should never depend on headers being used well (e.g., be prepared for the client to repeat a request for something it should have cached, or for the server to send the whole entity when you request a range), barring the obvious case of authentication headers (under the fail-to-secure principle).
Two things.
First of all, why not just run a test that gives the browser progressively larger and larger headers and wait till it hits a number that doesn't work? Just run it once in each browser. That's the most surefire way to figure this out. Even if it's not entirely comprehensive, you at least have some practical numbers to go off of, and those numbers will likely cover a huge majority of your users.
Second, I agree with everyone saying that this is a bad idea. It should not be hard to find a different solution if you are really that concerned about hitting the limit. Even if you do test on every browser, there are still firewalls, etc to worry about, and there is absolutely no way you will be able to test every combination (and I'm almost positive that no one else has done this before you). You will not be able to get a hard limit for every case.
Though in theory, this should all work out fine, there might later be that one edge case that bites you in the butt if you decide to do this.
TL;DR: This is a bad idea. Save yourself the trouble and find a real solution instead of a workaround.
Edit: Since you mention that the requests can come from several types of sources, why not just specify the source in the request header and have the data contained entirely in the body? Have some kind of Source or ClientType field in the header that specifies where the request is coming from. If it's coming from a browser, include the HTML in the body; if it's coming from a PHP application, put some PHP-specific stuff in there; etc etc. If the field is empty, don't add any extra data at all.
The RFC for HTTP/1.1 clearly does not limit the length of the headers or the body.
According to this page modern browsers (Firefox, Safari, Opera), with the exception of IE can handle very long URIs: https://web.archive.org/web/20191019132547/https://boutell.com/newfaq/misc/urllength.html. I know it is different from receiving headers, but at least shows that they can create and send huge HTTP requests (possibly unlimited length).
If there's any limit in the browsers it would be something like the size of the available memory or limit of a variable type, etc.
Theoretically, there's no limit to the amount of data that can be sent in the browser. It's almost like saying there's a limit to the amount of content that can be in the body of a web page.
If possible, try to transmit the data through the body of the document. To be on the safe side, consider splitting the data up, so that there are multiple passes for loading.

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

JSONP and Cross-Domain queries - How to Update/Manipulate instead of just read

So I'm reading The Art & Science of Javascript, which is a good book, and it has a good section on JSONP. I've been reading all I can about it today, and even looking through every question here on StackOverflow. JSONP is a great idea, but it only seems to resolve the "Same Origin Problem" for getting data, but doesn't address it for changing data.
Did I just miss all the blogs that talked about this, or is JSONP not the solution I was hoping for?
JSONP results in a SCRIPT tag being generated to another server with any parameters that might be required as a GET request. e.g.
<script src="http://myserver.com/getjson?customer=232&callback=jsonp543354" type="text/javascript">
</script>
There is technically nothing to stop this sort of request altering data on the server, e.g. specifying newName=Tony. Your response could then be whether the update succeeded or not. You will be limited by whatever you can fit on a querystring. If you are going with this approach add some random element as a parameter so that proxy's won't cache it.
Some people may consider this goes against the way GET's are supposed to work i.e. they shouldn't cause data to change.
Yes, and honestly I would like to stick to that paradigm. However, I might bend the rule and say that, requests which do not alter/deal with CRUCIAL data will be accessible via GET calls... hm...
For instance, I am building a shopping cart system, and I think that allowing the adding/removing/etc of items to/from a cart could very easily be exposed via GETs, since even though you can change data, you cannot do anything critical with it. If someone maliciously added 1,000 flatscreen monitors to your shopping cart, there would be at least one verification step that would NOT be vulnerable to any attacks (a standard ASP.NET page at that point, with verification and all that jazz).
Is this a good/workable solution in anyones' opinion?

Resources