Why doesn't CloudFront GZIP my HTML/JSP pages - caching

I've configured CloudFront in front of my Elastic Beanstalk/load-balanced web application and the static content rule (Png images etc) are being caching and served GZIPPED.
However my JSP pages aren't being Gzipped.
Please note that I have explicitly set my default rule to not cache by setting the min TTL to 0, but it's probably un-necessary because my origin server isn't returning a Content-Length header for JSP pages, so it will never be cached anyway.
CloudFront will only cache if...
Filetype is supported (text/html is)
Response is 1,000 -> 10,000,000 bytes (it is)
Content-Length header must be provided (it is NOT)
Content-Encoding must not be set (it is not)
So that explains why it's not being cached, fair enough.
But why don't my HTML pages get GZIPPED? FYI my HTML and JSP file extensions are all processed through the JSP processor.

Looks like I was right, until my page was modified to return the Content-Length response header, CloudFront did not cache nor GZIP the content.

Related

Google Cloud CDN "Force Cache All Content" NOT Cache All Content

I am using Google Cloud CDN for my WordPress website https://cdn.datanumen.com. I have enabled "Force Cache All Content" option. However, the web pages, css files, javascript files are still not cached. Only the images are cached.
For example, I test the page at https://cdn.datanumen.com/, I have used Ctrl + F5 to refresh the webpage for many times, but always get the same results.
Below is the web page I try to load:
There are "Cache-Control" field in the response header, but no "Age" field. Based on Google document, if a cache hits and cached content is served, there will be a "Age" field. So without "Age" means the file is not cached.
I also check the log:
In the log, cacheFillBytes is 26776 and cacheLookup is true. It seems that Google CDN is trying to lookup cache and fill cache with the contents. But the statusDetails shows "response_sent_by_backend", so the contents are still served from the backend. Normally this should only occur for the first time when I visit the website. But for my case, even if I press Ctrl + F5 to refresh my website for many times, I will always get the same result, the statusDetails never shows "response_sent_by_cache" for page such as https://cdn.datanumen.com/
Why?
Update:
I notice there is a "Vary" field in the response header:
Based on https://cloud.google.com/cdn/docs/caching#non-cacheable_content, if Vary header Has a value other than Accept, Accept-Encoding, or Origin, then the content will not be cached, since for my case "Vary" header is "Accept-Encoding,Cookie,User-Agent", it is not cached. But my question is how to deal with issue and let the content be cached forcely?
Update 2
I have changed the site to a real WordPress site, since that is what I need finally. I plan to use Google Cloud CDN purchased support to see if they can help on this case.
According to the Google Cloud CDN's documentation, the best way to solve your problem is actually using the CACHE_ALL_STATIC cache mode:
CACHE_ALL_STATIC: Automatically caches static content that doesn't have the no-store or private directive. Origin responses that set valid caching directives are also cached. This is the default behavior for Cloud CDN-enabled backends created by using the gcloud command-line tool or the REST API.
USE_ORIGIN_HEADERS: Requires origin responses to set valid cache directives and valid caching headers. Responses without these directives are forwarded from the origin.
FORCE_CACHE_ALL: Unconditionally caches responses, overring any cache directives set by the origin. This mode is not appropriate if the backend serves private, per-user content, such as dynamic HTML or API responses.
But in the case of the last cache mode, there are two warnings about its usage:
When you set the cache mode to FORCE_CACHE_ALL, the default time to live (TTL) for content caching is 3600 seconds (1 hour), unless you explicitly set a different TTL. Accepting the new default TTL of 1 hour might cause some entries that were previously considered fresh (due to having longer TTLs from origin headers) to now be considered stale.
The FORCE_CACHE_ALL mode overrides cache directives (Cache-Control and Expires) but does not override other origin response headers. In particular, a Vary header is still honored, and may suppress caching even in the presence of FORCE_CACHE_ALL. For more information, see Vary headers.

Cache-Control Headers not respected on CloudFlare

I am trying to get some html pages to be cached, the same way images are automatically cached via CloudFlare but I can't get CloudFlare to actually hits its cache for html.
According to the documentation (Ref: https://support.cloudflare.com/hc/en-us/articles/202775670-How-Do-I-Tell-CloudFlare-What-to-Cache-), it's possible to cache anything with a Cache-Control set to public with a max-age greater than 0.
I've tried various combinations of headers on my origin Nginx server without success. From a simple Cache-Control: public, max-age=31536000 to more complex headers including s-maxage=31536000, Pragma: public, ETag: "569ff137-6", Expires: Thu, 31 Dec 2037 23:55:55 GMT without any results.
Any ideas to force CloudFlare to serve the html pages from their cache?
PS: I am getting the CF-Cache-Status: HIT on the images and it works fine but on the html pages nothing, not even CF-Cache-Status: something. With a CloudFlare page rule for html pages, it seems to work fine but I want to avoid using one, mainly because it's too CloudFlare specific. I am not serving cookies or anything dynamic from these pages.
It is now possible to get Cloudflare to respect your web servers headers instead of overriding them with the minimum described in the Browser Cache TTL setting.
Firstly navigate to the Caching tab in the Cloudflare dashboard:
From here you can scroll down to the "Browser Cache Expiration" setting, from here you can select the "Respect Existing Headers" option in the dropdown:
Further reading:
Does CloudFlare honor my Expires and Cache-Control headers for static content?
Caching Anonymous Page Views
How do I cache static HTML?
Note: If this setting isn't chosen, Cloudflare will apply a default 4 hour minimum to Cache-Control headers. Once this setting is set, Cloudflare will not touch your Cache-Control headers (even if they're low or not at all set).
I stumbled on this too. From the page it says
Pro Tip: Sending cache directives from your origin for resources with extensions we don't cache by default will make absolutely no difference. To specify the caching duration at your origin for the extensions we don't cache by default, you'd have to create a Page Rule to "Cache Everything".
So it appears that you do have to set a page rule to use this for files that CloudFlare doesn't cache by default. This page describes this in more detail,
https://blog.cloudflare.com/edge-cache-expire-ttl-easiest-way-to-override/
That said it still didn't work for me and appears not to be supported. After contacting their support they confirmed this. Respect Origin Header has been removed from all plan types. So if you have no page rules they will respect the origin header.
This doesn't help for hitting their edge cache for html pages however. To that you have set up a page rule. Once that is done you can, I believe, set your max-age as low as your plan allows. Any lower and it gets over-written. That is to say, with no page rule you could say Cache-Control: max-age:30 and it would pass through. With a page rule that include edge caching your max-age then becomes subject to the minimum time your plan allows even if the page rule doesn't specify browser cache.
The CF documentation is very unclear. Go into "Page Rules", and define a rule that turns on caching, based upon wildcards -- and then it will work.

No Content-Length Header for App

I have a newsstand application that uses a bar to show download progress. It works by getting the content-length from the file download. This used to work on our development server, however we use an nginx server for production and it doesn't seem to be returning a content-length header.
Does anyone know why this would be or a better solution?
Thanks
The lack of a Content-Length header is likely caused by you having compression enabled on your live server but not on your dev server. Because Nginx compresses data as it's sent, it's not possible to send a Content-Length header at the start of the response, as the server can't know what size the data will be after it's compressed.
If you require a Content-Length header for a download progress then the best option is to compress the content yourself, set the Content-Length header to the size of the compressed data, and then serve the compressed data.
Although this will be slightly slower for the first user to download that piece of content, you can use it as an effective caching mechanism if you use unique filenames for the compressed files, with the filename generated from the parameters in the users request. You can also then use Nginx's x-sendfile ability to reduce the load on your server.
btw if you're using Amazon CloudFront CDN (and probably others) you really ought to be setting the Content-Length header as they can serve partial (aka corrupt) files, if there is no Content-Length header and the download from your server to CloudFront is interrupted during transfer.

Is there a way to remove some webpage headers in apache or nginX to reducae latency?

When a page is requested there would be parameters as below:
Pragma
Cache-Control
Content-Type
Date
Content-length
Is there a way to remove Date for example? Or remove most of them (except Pargam and some caching mechanisms) for image files? Could we get performance gain here? Should we do it on web server layer?
The Date header is required by HTTP/1.1. Content-Type and Content-length are also valuable and small to have, and you already mentioned that cache headers were important to you. So, I think you are looking in the wrong place for optimization.
What you can do is make sure that images served from a domain separate from the application to make sure the clients aren't sending cookie headers when they request static images. Using a CDN for serving static content is also recommended.

Does Cache work for partially loaded files?

This is not a "coding question", but more something like "how does it work?".
Let's consider I want to show an heavy pic on page 2.
If I'm preloading this pic on a page 1 (no display) and click on the page-2-link before it's fully loaded... What happens?
=> The page 2 loads and the end of heavy pic is also loaded, or cache doesn't work for partially loaded files?
Thanks for your explanations,
CH
In theory its very possible that part of the response gets cached, either by the web browser or by a proxy server between the end user and the web server. http supports range requests, where the client can ask for a specific slice of the total resource(like an image). All the big name web servers support range requests.
I really don't know off hand if any web browsers cache a partially downloaded resource, although it would be a simple test - clear the web browsers cache, hit a web page that loads a large external object, stop loading midway through. Make sure the webserver sends the following headers along with the response.
cache-control: max-age=10000
accept-ranges: bytes
Now make the request again but look at the http headers of the request to look for the browser asking for partial contents like Range: bytes=100000-90000000. It would obviously only ask for partial content if it had partially cached the file.
The max-age header tells the browser the file is cachable for a while, and the accept-ranges headers tells the browser the web server is capable of servicing partial content requests.

Resources