OkHttpResponseCache returning garbaged text - okhttp

Has anyone used OkHttpResponseCache on an authenticated HTTPS connection successfully? I'm facing this strange behavior where the second time content is served from cache it return garbage instead of the correct content.
The service I'm using as an example is protected with HTTP basic auth and accessed over HTTPS. I added the must-revalidate response header to allow the cache to store the response. It uses ETags as a mechanism for cache validation.
It works perfectly for the first cached response:
1 - First time I make a service call, the server return 200 OK. Debugging the response cache source code I could see the response if being passed to the put() method (which stores it in the file store).
2 - The second request is made. Server is hit and returns a 304 response. Checking the cache headers and step debugging confirmed me that the server indeed returned 304. One strange behavior tough: on the second response (now served by the cache), the ETag header now contains the value duplicated (instead of having a single value, it has this value in a list twice).
3 - I make a third request. Same behavior as above, same "duplication" of the ETag value but when I get the input stream, instead of the correct json text, I get garbage (as in a bunch of black diamonds with an interrogation inside).
I don't know if I'm doing something wrong (when configuring the cache). I don't know if this is an encoding problem, or if the cache tried to update the file store and messed the input up. I suspect after the first cache response (in the second attempt) the presence of a second ETag value invalidates the headers and the cache attempts to store the response once again and ends messing it up. I wasn't able to step debug through it to confirm yet.
I tried to "translate" the garbage text with UTF-8, 16, ASCII, ISO to no avail. I tried using HTTP instead of HTTPS and got the same behavior.
Did someone encounter something similar and was able to solve it? Is it a bug in the cache or could I be doing something wrong?

There was a bug in the way that OkHttp updates the Content-Encoding: gzip header when doing conditional GETs. It's fixed in the OkHttp 1.3.

Related

RefreshHit from cloudfront even with cache-control: max-age=0, no-store

Cloudfront is getting a RefreshHit for a request that is not supposed to be cached at all.
It shouldn't be cached because:
It has cache-control: max-age=0, no-store;
The Minimum TTL is 0; and
I've created multiple invalidations (on /*) so this cached resource isn't from some historical deploy
Any idea why I'm getting RefreshHits?
I also tried modifying Cache-Control to be cache-control no-store, stale-if-error=0, creating a new invalidation on /* and now I'm seeing a cache hit (this time in Firefox):
After talking extensively with support, they explained what's going on.
So, if you have no-store and a Minimum TTL of 0, then CloudFront will indeed not store your resources. However, if your Origin is taking a long time to respond (so likely under heavy load), while CloudFront waits for the response to the request, if it gets another identical request (identical with respect to the cache key), then it'll send the one response to both requests. This is in order to lighten the load on the server. (see docs)
Support was calling these "collapse hits" although I don't see that in the docs.
So, it seems you can't have a single Behavior serving some pages that must have a unique response per request while serving other pages that are cached. Support said:
I just confirmed that, with min TTL 0 and cache-control: no-store, we cannot disable collapse hit. If you do need to fully disable cloudfront cache, you can use cache policy CachingDisabled
We'll be making a behavior for every path prefix that we need caching on. It seems there was no better way than this for our use-case (transitioning our website one page at a time from a non-cacheable, backend-rendered jinja2/jQuery to a cacheable, client-side rendered React/Next.js).
It's probably too late for OP's project, but I would personally handle this with a simple origin-response Lambda#Edge function, and a single cache behavior for /* and cache policy. You can write all of the filtering/caching logic in the origin-response function. That way you only manage one bit of function code in one place, instead of a bunch of individual cache behaviors (and possibly a bunch of cache policies).
For example, an origin-response function that looks for a cache-control response header coming from your origin. If it exists, pass it back to the client. However if it doesn't exist (or if you want to overwrite it with something else) then you can create the response header there. The edge doesn't care if the cache-control header came from your origin, or from an origin-response Lambda. To the edge, it is all the same.
Another trick you can use in order to avoid caching and still use the default CloudFront behavior, is: Have a dummy unused query parameter that equals to a unique value for each request.
Python example:
import requests
import uuid
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
requests.get(f'http://my-test-server-x.com/my/path?nochace={uuid.uuid4()}')
Note that both calls will reach destination and will not get response from cache since the uuid.uuid4() will always generate a unique value
This works since by default (if not defined otherwise in the Behavior section) the query parameters are part of the cache key
Note: Doing so will avoid cache use, hence your backend might be loaded with requests.

HTTP GET vs POST for Idempotent Reporting

I'm building a web-based reporting tool that queries but does not change large amounts of data.
In order to verify the reporting query, I am using a form for input validation.
I know the following about HTTP GET:
It should be used for idempotent requests
Repeated requests may be cached by the browser
What about the following situations?
The data being reported changes every minute and must not be cached?
The query string is very large and greater than the 2000 character URL limit?
I know I can easily just use POST and "break the rules", but are there definitive situations in which POST is recommended for idempotent requests?
Also, I'm submitting the form via AJAX and the framework is Python/Django, but I don't think that should change anything.
I think that using POST for this sort situation is acceptable. Citing the HTTP 1.1 RFC
The action performed by the POST method might not result in a
resource that can be identified by a URI. In this case, either 200
(OK) or 204 (No Content) is the appropriate response status,
depending on whether or not the response includes an entity that
describes the result.
In your case a "search result" resource is created on the server which adheres to the HTTP POST request specification. You can either opt to return the result resource as the response or as a separate URI to the just created resource and may be deleted as the result resource is no longer necessary after one minute's time(i.e as you said data changes every one minute).
The data being reported changes every minute
Every time you make a request, it is going to create a new resource based on your above statement.
Additionally you can return 201 status and a URL to retrieve the search result resource but I m not sure if you want this sort of behavior but I just provided as a side note.
Second part of your first question says results must not be cached. Well this is something you configure on the server to return necessary HTTP headers to force intermediary proxies and clients to not cache the result, for example, with If-Modified-Since, Cache-control etc.
Your second question is already answered as you have to use POST request instead of GET request due to the URL character limit.

Cache Policy - caching only if request succeeded

I have enabled some cache policies on a few resources end points. System works quite well, response is cached, the following requests hit the cache, cache is correctly refreshed when I set it to be refreshed.
My only concern is that sometimes a client makes a request that does not hit the cache (for example, because the cache must be refreshed), the server in that moment returns an error (it can happen, it's statistic...) and so the cached response is not a "normal" response (e.g. 2xx) but a 4xx, or a 5xx response.
I would like to know if it is possible to cache the response only if, for example, the server response code is 2xx.
I didn't find any example on Apigee docs for doing this, also if there are some parameters for the cache policy called "SkipCachePopulation" that I think I can use for this purpose.
Any suggestion?
Yes, you can use the SkipCachePopulation field of ResponseCache. It uses a condition to determine when the cache population will not occur. Here is an example:
<SkipCachePopulation>response.status.code >= 400</SkipCachePopulation>

How does adding a random number to the end of an AJAX server request prevent caching?

How exactly does adding a random number to the end of an AJAX server call prevent the database server or browser (not entirely sure which one is intended) from caching? why does this work?
It is intended to prevent client-side (or reverse proxy) caching.
Since the cache will be keyed on the exact request, by adding a random element to the request, the exact request URL should never be seen twice; so it won't be used more than once, and an intelligent cache won't bother keeping around something that's never been seen more than once, at least, not for long.
It's to prevent your browser (and to a reasonable amount, a web proxy) from caching requests. Typically, a query parameter - like ?rand2024= tells the browser/proxy to send the onward request with a parameter telling your application to behave differently. That's why such requests are useful to bust caches.
Your browser caches the web page keyed by the exact text of the URL, so adding a random-number parameter ensures that the URL is different every time - thus no real caching. Your browser doesn't know that the server is (hopefully) ignoring this parameter.

What are the advantages of using a GET request over a POST request?

Several of my ajax applications in the past have used GET request but now I'm starting to use POST request instead. POST requests seem to be slightly more secure and definitely more url friendly/pretty. Thus, i'm wondering if there is any reason why I should use GET request at all.
I generally set up the question as thus: Does anything important change after the request? (Logging and the like notwithstanding). If it does, it should be a POST request, if it doesn't, it should be a GET request.
I'm glad that you call POST requests "slightly" more secure, because that's pretty much what they are; it's trivial to fake a POST request by a user to a page. Making it a POST request, however, prevents web accelerators or reloads from re-triggering the action accidentally.
As AJAX, there is one more consideration: if you are returning JSON with callback support, be very careful not to put any sensitive data that you don't want other websites to be able to see in there. Wikipedia had a vulnerability along these lines where the user anti-CSRF token was revealed via their JSON API.
All good points, however, in answer to the question, GET requests are more useful in certain scenarios over POST requests:
They can be bookmarked
They can be cached
They're faster
They have known consequences (assuming they don't change data), so visiting them multiple
times is not a problem.
For the sake of posterity, updating this comment with the blog notes re: point #3 here, all credit to Omar AL Zabir (the author of the referenced blog post):
"Atlas by default makes HTTP POST for all AJAX calls. Http POST is
more expensive than Http GET. It transmits more bytes over the wire,
thus taking precious network time and it also makes ASP.NET do extra
processing on the server end. So, you should use Http Get as much as
possible. However, Http Get does not allow you to pass objects as
parameters. You can pass numeric, string and date only. When you make
a Http Get call, Atlas builds an encoded url and makes a hit to that
url. So, you must not pass too much content which makes the url become
larger than 2048 chars. As far as I know, that’s what is the max
length of any url.
Another evil thing about http post is, it’s actually 2 calls. First
browser sends the http post headers and server replies with “HTTP 100
Continue”. When browser receives this, it sends the actual body."
You should use GET where you're doing a request which has no side effects, e.g. just fetching some info. This request can:
Be repeated without any problem - if the browser detects an error it can silently retry
Have its result cached by the browser
Be cached by a proxy
These things are all good. Anything which is only retrieving data (particularly public data) should really be a GET. The server should send sensible Last-Modified: and Expires: headers to allow caching if required.
There is one other difference not mentioned by anyone.
GET requests are passed in the URL string and are therefore subject to a length limit usually dependent on the browser. It seems that most are around 2000 chars.
POST requests can be much much larger - in fact not limited really. So if you're needing to request data from a web server and you're passing in lots of parameter information then a POST request might be the only option.
So, as mentioned before really a GET request is for requesting data (no side effects) while a POST request is generally used for transmitting data back to the server to be stored (with side effects). e.g. Use POST to upload a file. GET to retrieve a file.
There was a time when IE I believe had a very short GET URL string. Some applications like Lotus notes use large numbers of random characters to represent document id's. I had the displeasure of using another product that generated random strings so the page URL was unique each time. The random string was HUGE... and it didn't always work with IE6 from memory.
This might help you to decide where to use GET and where to use POST:
URIs, Addressability, and the use of HTTP GET and POST.
POST requests are just as insecure as GETs. The main difference is that POST is used to modify the state of the server application, while GET only requests data from it.
The difference matters when you use clean, "restful" URLs, where the URL itself specifies the resource, and the different methods trigger different actions on the server side.
Perhaps most importantly, GET is book-markable / viewable in url history, and searchable with Google.
POST is important where you don't want the event to be bookmarkable or able to be typed in as a URL - otherwise you (or Google crawling your URLS) could end up accidentally doing things like deleting users from your system, for example.
GET
POST
In GET method, values are visible in the URL
In POST method, values are not visible in the URL.
GET has a limitation on the length of the values, generally 255 characters.
POST has no limitation on the length of the values since they are submitted via the body of HTTP.
GET performs are better compared to POST because of the simple nature of appending the values in the URL.
It has lower performance as compared to GET method because of time spent in including POST values in the HTTP body
This method supports only string data types.
This method supports different data types, such as string, numeric, binary, etc.
GET results can be bookmarked.
POST results cannot be bookmarked.
GET request is often cacheable.
The POST request is hardly cacheable.
GET Parameters remain in web browser history.
Parameters are not saved in web browser history.
Source and more in depth analysis: https://www.guru99.com/difference-get-post-http.html

Resources