Geo location / filtering and HTTP Caching - asp.net-web-api

I'm trying to add cache support (both HTTP and server) for a ASP.NET Web Api solution.
The solution is geo located, meaning that I can get different results based on the caller IP address.
The question can be trivially solved for the server side cache, using an approach similar to VaryByCustom (like this one). However that does not solve the problem with the client side HTTP caches. Here are the alternatives
I'm considering the following options:
Enforcing a must-revalidate in the cache
Keep the validation server side using the same algorithm to VaryByCustom, but include the extra cache revalidate calls on the server side with ETAGS or any mechanism that keep track of the originally cached value country of origin.
Creating country specific routes HTTP 302
In this scenario an application invoking
http://site/UK/content
Redirects to US version if originating from an US IP address when the cache has expired
http://site/US/content
It might present out-of-date contents that do not match the IP of origin local. That is not a serious problem if the cache expires is a small value (< 1 hour), since country changes are fairly uncommon.
What is the recommended solution?

I'm not sure I understand the problem.
For client caching, if you enable private caching then a user in UK will cache the UK version of http://site/content and the US user will cache the US version of http://site/content.
The only problem I can see is if a user travels from the US to the UK and accesses the content. Or if you allow public caching and some intermediary is shared by US and UK users.

After detailed evaluation first approach was chosen. Actual implementation is:
Create a cache key that depends on the country of origin IP address
Create a ETag for that cache key and store it in Server cache
Additional requests that include ETag If-None-Match header are evaluates in server for cache freshness:
If the country of origin is the same, the cache key will be the same and ETag is valid, returning a HTTP 304 not modified
If the country of origin is different, cache key will be different and such the ETag is not valid, returning a HTTP 200 and returning a new ETag.
Agree with Poul-Henning Kamp geolocation should be a transport level thing, but unfortunately is not, so this is the only way we could come up with to ensure cache freshness for a given country.
The disadvantage is that cannot have any infrastructure cache, e.g., all requests need to check the server for cache freshness.

Related

Caching with SSL certification

I read if the request is authenticated or secure, it won't be cached. We previously worked on our cache and now planning to purchase a SSL certificate.
If caching cannot be done with SSL connection then is that mean our work on caching is useless?
Reference: http://www.mnot.net/cache_docs/
Your reference is wrong. Content sent over https will be cached in modern browsers, but they obviously cannot be cached in intermediate proxies. See http://arstechnica.com/business/2011/03/https-is-great-here-is-why-everyone-needs-to-use-it-so-ars-can-too/ or https://blog.httpwatch.com/2011/01/28/top-7-myths-about-https/ for example.
You can use the Cache-Control: public header to allow a representation served over HTTPS to be cached.
While the document you refer to says "If the request is authenticated or secure (i.e., HTTPS), it won’t be cached.", it's within a paragraph starting with "Generally speaking, these are the most common rules that are followed [...]".
The same document goes into more details after this:
Useful Cache-Control response headers include:
public — marks authenticated responses as cacheable; normally, if HTTP authentication is required, responses are automatically private.
(What applies to HTTP with authentication also applies to HTTPS.)
Obviously, documents that actually contain sensitive information only aimed for the authenticated user should not be served with this header, since they really shouldn't be cached. However, using this header for items that are suitable for caching (e.g. common images and scripts) should improve the performance of your website (as expected for caching over plain HTTP).
What will never happen with HTTPS is the caching of resources by intermediate proxy servers (between the client and your web-server, at least the external part, if you have a load-balancer or similar). Some CDNs will serve content over HTTPS (assuming it's suitable for your system to trust these CDNs). In general, these proxy servers wouldn't fall under the control of your cache design anyway.

How to invalidate the cache of an arbitrary URL?

According to http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10 clients must invalidate the cache associated with a URL after a POST, PUT, or DELETE request.
Is it possible to instruct a web browser to invalidate the cache of an arbitrary URL, without making an HTTP request to it?
For example:
PUT /companies/Nintendo creates a new company called "Nintendo"
GET /companies lists all companies
Every time I create a new company, I want to invalidate the cache associated with GET /companies. The browser doesn't do this automatically because the two operate on different URLs.
Is the Cache-Control mechanism inappropriate for this situation? Should I use no-cache along with ETag instead? What is the best-practice for this situation?
I know I can pass no-cache the next time I GET /companies but that requires the application to keep track URL invalidation instead of pushing the responsibility to the browser. Meaning, I want to invalidate the URL after step 1 as opposed to having to persist this information and applying it at step 2. Any ideas?
Yes, you can (within the same domain). From this answer (slightly paraphrased):
In response to a PUT or POST request, if the Content-Location header URI is different from the request URI, then the cache for the Content-Location URI is invalidated.
So in your case, include a Content-Location: /companies header in response to your POST request. This will invalidate the browser's cached version of /companies.
Note that this does not work for GET requests.
No, in HTTP/1.1 you may only invalidate a client's cache for a resource in a response to a request for that resource. It may be in response to a PUT, POST or DELETE rather than a GET (see RFC 7234, section 4.4 for details).
If you have a resource where you need clients to confirm that they have the latest version then no-cache and an entity tag is an ideal solution.
HTTP/2 allows for pushing a cache clear (Nine Things to Expect from HTTP/2 4. Cache Pushing).
In the link which you have given "the phrase "invalidate an entity" means that the cache will either remove all instances of that entity from its storage, or will mark these as "invalid" and in need of a mandatory revalidation before they can be returned in response to a subsequent request.". Now the question is where are the caches? I believe the Cache the article is talking about is the server cache.
I have worked on a project in VC++ where whenever a model changes the cache is updated. There is a programming logic implemention involved to achieve this. Your mentioned article rightly says "There is no way for the HTTP protocol to guarantee that all such cache entries are marked invalid" HTTP Protocol cannot invalidate cache on its own.
In our project example we used publish subscribe mechanism. Wheneven an Object of class A is updated/inserted it is published to a bus. The controllers register to listen to objects on the Bus. Suppose A Controller is interested in Object A changes, it will not be called back whenever Object Type B is changed and published. When Object Type A indeed is changed and published then Controller A Listener function updates the Cache with latest changes of Object A. The subsequent request of GET /companies will get the latest from the cache. Now there is a time gap between changing the object A and the Cache being refreshed with the latest changes. To avoid something wrong happening in this time gap Object is marked dirty before the Object A Changes. So a request coming inbetween of these times will wait for dirty flag being cleared.
There is also a browser cache. I remember ETAGS are used to validate this. ETAG is the checksum of the resource. For this Client should maintain old ETAG value somehow. If the checksum of resource has changed then the new resource with HTTP 200 is sent else HTTP 304 (use local copy) is sent.
[Update]
PUT /companies/Nintendo
GET /companies
are two different resources. Your the cache for /companies/Nintendo is only expected to be updated and not /companies (I am talking of client side cache) when PUT /companies/Nintendo request is executed. Suppose you call GET /companies/Nintendo next time, based on http headers the response is returned. GET /companies is a brand new request as it points to different resource.
Now question is what should be the http headers? It is purely application specific. Suppose it is stock quote I would not cache. Suppose it is NEWS item I would cache for certain time. Your reference link http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html has all the details of Cache http headers. Only thing not mentioned much is ETag usage. ETag can have checksum of resource. Check http://en.wikipedia.org/wiki/HTTP_ETag and also check https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers

Amazon Cloudfront: private content but maximise local browser caching

For JPEG image delivery in my web app, I am considering using Amazon S3 (or Amazon Cloudfront
if it turns out to be the better option) but have two, possibly opposing,
requirements:
The images are private content; I want to use signed URLs with short expiration times.
The images are large; I want them cached long-term by the users' browser.
The approach I'm thinking is:
User requests www.myserver.com/the_image
Logic on my server determines the user is allowed to view the image. If they are allowed...
Redirect the browser (is HTTP 307 best ?) to a signed Cloudfront URL
Signed Cloudfront URL expires in 60 seconds but its response includes "Cache-Control max-age=31536000, private"
The problem I forsee is that the next time the page loads, the browser will be looking for
www.myserver.com/the_image but its cache will be for the signed Cloudfront URL. My server
will return a different signed Cloudfront URL the second time, due to very short
expiration times, so the browser won't know it can use its cache.
Is there a way round this without having my webserver proxy the image from Cloudfront (which obviously negates all the
benefits of using Cloudfront)?
Wondering if there may be something I could do with etag and HTTP 304 but can't quite join the dots...
To summarize, you have private images you'd like to serve through Amazon Cloudfront via signed urls with a very short expiration. However, while access by a particular url may be time limited, it is desirable that the client serve the image from cache on subsequent requests even after the url expiration.
Regardless of how the client arrives at the cloudfront url (directly or via some server redirect), the client cache of the image will only be associated with the particular url that was used to request the image (and not any other url).
For example, suppose your signed url is the following (expiry timestamp shortened for example purposes):
http://[domain].cloudfront.net/image.jpg?Expires=1000&Signature=[Signature]
If you'd like the client to benefit from caching, you have to send it to the same url. You cannot, for example, direct the client to the following url and expect the client to use a cached response from the first url:
http://[domain].cloudfront.net/image.jpg?Expires=5000&Signature=[Signature]
There are currently no cache control mechanisms to get around this, including ETag, Vary, etc. The nature of client caching on the web is that a resource in cache is associated with a url, and the purpose of the other mechanisms is to help the client determine when its cached version of a resource identified by a particular url is still fresh.
You're therefore stuck in a situation where, to benefit from a cached response, you have to send the client to the same url as the first request. There are potential ways to accomplish this (cookies, local storage, server scripting, etc.), and let's suppose that you have implemented one.
You next have to consider that caching is only just a suggestion and even then it isn't a guarantee. If you expect the client to have the image cached and serve it the original url to benefit from that caching, you run the risk of a cache miss. In the case of a cache miss after the url expiry time, the original url is no longer valid. The client is then left unable to display the image (from the cache or from the provided url).
The behavior you're looking for simply cannot be provided by conventional caching when the expiry time is in the url.
Since the desired behavior cannot be achieved, you might consider your next best options, each of which will require giving up on one aspect of your requirement. In the order I would consider them:
If you give up short expiry times, you could use longer expiry times and rotate urls. For example, you might set the url expiry to midnight and then serve that same url for all requests that day. Your client will benefit from caching for the day, which is likely better than none at all. Obvious disadvantage is that your urls are valid longer.
If you give up content delivery, you could serve the images from a server which checks for access with each request. Clients will be able to cache the resource for as long as you want, which may be better than content delivery depending on the frequency of cache hits. A variation of this is to trade Amazon CloudFront for another provider, since there may be other content delivery networks which support this behavior (although I don't know of any). The loss of the content delivery network may be a disadvantage or may not matter much depending on your specific visitors.
If you give up the simplicity of a single static HTTP request, you could use client side scripting to determine the request(s) that should be made. For example, in javascript you could attempt to retrieve the resource using the original url (to benefit from caching), and if it fails (due to a cache miss and lapsed expiry) request a new url to use for the resource. A variation of this is to use some caching mechanism other than the browser cache, such as local storage. The disadvantage here is increased complexity and compromised ability for the browser to prefetch.
Save a list of user+image+expiration time -> cloudfront links. If a user has an non-expired cloudfront link use it for an image and don't generate a new one.
It seems you already solved the issue. You said that your server is issuing a redirect http 307 to the cloudfront URL (signed URL) so the browser caches only the cloudfront URL not your URL(www.myserver.com/the_image). So the scenario is as follows :
Client 1 checks www.myserver.com/the_image -> is redirect to CloudFront URL -> content is cached
The CloudFront url now expires.
Client 1 checks again www.myserver.com/the_image -> is redirected to the same CloudFront URL-> retrieves the content from cache without to fetch again the cloudfront content.
Client 2 checks www.myserver.com/the_image -> is redirected to CloudFront URL which denies its accesss because the signature expired.

When does a browser send a conditional get

My understanding is a browser sends a conditional get if it is not sure if the compoonent it has is up to date. The question is what defines "not sure". I presume it varys on browser and maybe other conditions. I also presume it's not something you can control, i.e. I can do anything to make browser change the not sure criteria. I can't set something in the way I can set an expires header to what I want on a Http server. Is this correct?
Note:P if you can answer this question with just areally good link that's fine. I couldn't find one.
The HTTP has an expiration model. It defines how servers can specify their responses to expire, and how the age and freshness of a response can be determined by caches. Additionally to that, there are further Cache-Control directives that can modify the behavior for how responses are to be handled dependent or independent of their freshness.
To conclude, HTTP caching is quite complex and the actual behavior depends on multiple factors:
The cache-control directives can be broken down into these general categories:
Restrictions on what are cacheable; these may only be imposed by the origin server.
Restrictions on what may be stored by a cache; these may be imposed by either the origin server or the user agent.
Modifications of the basic expiration mechanism; these may be imposed by either the origin server or the user agent.
Controls over cache revalidation and reload; these may only be imposed by a user agent.
Control over transformation of entities.
But in the end, it all depends on the user agent’s obedience of these rules.

How to detect whether web page content is different from cached version

Hi guys
As you know checking process of web pages content is a little different from static pages or personal files on our machines because content of Dynamic web pages are changed on each request. So if we are going to use checksums to identifying changes, We'll fail! very simple example is when site owner are use Google Ads on him website; on each request Ads are different from previous. Also if we are going to cache only on period time, also We'll fail, because some web pages aren't updated every years but some are every minutes (if not seconds).
So what is better approach to solve this issue? (Thanks)
UPDATE
Another option is use of LastModified http-header! but this is strong approach?
Browsers do this automatically with the help of the several caching mechanisms that HTTP provides. The two mechanisms most obviously useful for determining whether a page has changed is the concept of Entity Tags and the Last Modified HTTP header. These mechanisms allow the browser to make conditional requests to a web site, eg. fetch a page only if it has been changed.
Quoting RFC 2616 on HTTP 1.1:
3.11 Entity Tags
Entity tags are used for comparing two or more entities from the same
requested resource. HTTP/1.1 uses entity tags in the ETag (section
14.19), If-Match (section 14.24), If-None-Match (section 14.26), and
If-Range (section 14.27) header fields. The definition of how they
are used and compared as cache validators is in section 13.3.3. An
entity tag consists of an opaque quoted string, possibly prefixed by
a weakness indicator.
The key point here is that the ETag is a cache validator. If a browser has a cached version of a page (called a resource in the RFC), it can use the ETag to determine whether the cached page is still valid, ie. if the page hasn't changed on the server.
And about the modification date:
14.25 If-Modified-Since
The If-Modified-Since request-header field is used with a method to
make it conditional: if the requested variant has not been modified
since the time specified in this field, an entity will not be
returned from the server; instead, a 304 (not modified) response will
be returned without any message-body.
The key point here is that the server may know when a page has been modified, and may then inform the client.
If you open a HTTP monitor (such as Fiddler for Windows) and watch your browser communicate with web sites, you'll see the use of these mechanisms first-hand when the browser makes conditional requests.
To specifically address your question about the Last Modified header, this header in itself won't work for the majority of pages you'll find. But in combination with the ETag it can get you started.

Resources