I have an Amazon EC2 Web Server instance which serves gzipped content when the Accept-Encoding header is set to gzip. But when I make the same request with the exact same header to a CloudFront CDN with the origin server as my Amazon EC2 instance, it doesn't send back a gzipped response.
I also tried creating a new CloudFront distribution(because I thought that the old distribution might have uncompressed response cached) and then making the same request and I still get an uncompressed response.
Can someone please tell me what I may be missing?
This has been marked as a possible duplicate of a question relating to S3. The question is around EC2 - not S3, so I don't think this is a duplicate.
You’re likely seeing this issue due to Cloudfront adding a ‘Via’ header to the requests made to your origin server - it’s a know issue with IIS.
If you were to look at the incoming HTTP requests to your origin, you’d see something like this in your HTTP headers:
Via=1.1 9dc1db658f6cee1429b5ff20764c5b07.cloudfront.net (CloudFront)
X-Amz-Cf-Id=k7rFUA2mss4oJDdT7rA0HyjG_XV__XwBV14juZ8ZAQCrbfOrye438A==
X-Forwarded-For=121.125.239.19, 116.127.54.19
The addition of a ‘Via’ header is standard proxy server behaviour. When IIS sees this, it drops the gzip encryption (I’m guessing due to an assumption that older proxy servers couldn’t handle compressed content).
If you make the following changes to your applicationHost.config, you should rectify the issue:
<location path="Your Site">
<system.webServer>
<httpCompression noCompressionForHttp10="false" noCompressionForProxies="false" />
</system.webServer>
</location>
The other issue to watch out for is that IIS doesn’t always compress the first response it receives for a given resource, therefore, Cloudfront may make a request to the origin, receive, cache and then serve uncompressed version of the content to subsequent visitors. Again you can modify this behaviour using the serverRuntime settings in the applicationHost.config:
<location path="Your Site">
<system.webServer>
<httpCompression noCompressionForHttp10="false" noCompressionForProxies="false" />
<serverRuntime frequentHitThreshold="1" frequentHitTimePeriod="00:00:05" />
</system.webServer>
More details on these settings here:
http://www.iis.net/configreference/system.webserver/serverruntime
http://www.iis.net/configreference/system.webserver/httpcompression
Credit to this blog post for explaining the issue:
http://codepolice.net/2012/06/26/problems-with-gzip-when-using-iis-7-5-as-an-origin-server-for-a-cdn/
Related
Somebody commented on this question about caching:
...using a Cache-Control value of: max-age=0, s-maxage=604800 seems to get my desired behavior of instant client updates on new page contents, but still caching at the CDN level
Will I really get caching at CDN level and instant updates for my users?
Does it make sense? How does that combination work?
Yes, it makes sense.
With the configuration mentioned in that comment, your users will get instant stale responses, so they'll have to verify it the next time they make a resquest. And the CDN will cache a valid response for 604800 seconds. So repeated requests will be mostly served by CDN, instead of the Origin server.
But what if you update your app? What happens to the stale cache on the CDN?
After a new deployment, you need to make sure all of your stale cache from the CDN will be purged / cleared.
For example, see Purging cached resources from Cloudflare: it gives you numerous options on how to do that.
Purge by single-file (by URL)
Purging by single-file through your Cloudflare dashboard
Purge everything
Purge cached resources through the API
etc
Firebase Hosting, for example, will clear all CDN cache after a new deployment:
Any requested static content is automatically cached on the CDN. If you redeploy your site's content, Firebase Hosting automatically clears all your cached static content across the CDN until the next request.
As far as the setting suggested in the comment, I think Cache-Control: no-cache would do a better job.
From MDN - Cache Control:
no-cache
The response may be stored by any cache, even if the response is normally non-cacheable. However, the stored response MUST always go through validation with the origin server first before using it, therefore, you cannot use no-cache in-conjunction with immutable. If you mean to not store the response in any cache, use no-store instead. This directive is not effective in preventing caches from storing your response.
When I navigate to web.mysite.com, a static SPA hosted in S3, it has an iframe which has a src of mysite.com/some/path, which is a Spring Boot MVC application in Elastic Beanstalk. Both are behind Cloudfront distributions for HTTPS. This path is handled in the application with a custom resource resolver. This loads successfully, but inside the iframe content there is a script tag looking for mysite.com/some/path/thatsdifferent, handled by the same resolver.
This second request fails with a 403 and I cannot determine why. Navigating to the failing mysite.com/some/path/thatsdifferent directly in my browser or using postman succeeds with a 200 status. The server is configured to allow requests from web.mysite.com through CORS configuration (and there is no CORS-related error message) and Spring Security is configured to permitAll any requests to /some/** regardless of authentication. There is no response body or error message beyond the header x-cache: Error from cloudfront.
If I navigate to the-beanstalk-env-url.com/some/path, it loads the html and then successfully loads the content from the-beanstalk-env-url.com/some/path/thatsdifferent.
Requests to a few different but similar paths succeed. Going to a path which definitely 100% does not exists returns a 404.
The server logs show that the request is being successfully handled and Cloudfront is returning reasonable responses to the client. Looking at the Cloudfront logs simply reports a 403, without any additional information.
Almost 100% of Cloudfront 403 error articles and questions involve S3, which is not the part which is failing here.
Changing the Cloudfront distribution Allowed Methods from GET, HEAD to GET, HEAD, OPTIONS causes the requests directly to mysite.com/some/path/thatsdifferent to begin failing with invalid CORS request, this was fixed by whitelisting the Accept, Authorization, Host, Origin and Referer headers. This did not fix the underlying error.
Adjusting the logging for org.springframework.security doesn't log any extra information when a failing request occurs, my application security configuration is not what is causing the error.
After replacing Cloudfront with a load balancer on my environment in Route 53, the scenario works as expected, so the problem is definitely in Cloudfront.
The solution was to switch the Cloudfront Origin Protocol policy from HTTP Only to HTTPS Only.
I don't know why this mattered from the script file and not the html file, but I decided to test it out when I discovered that if I tried to connect to the Beanstalk environment URL via https, Chrome was warning me that the certificate being used was setup for the domain that was served by the Cloudfront distribution that was causing trouble.
We're using Azure CDN (Verizon Standard) to serve images to ecommerce sites, however, we're experiencing unreasonable amount of loads from origin, images which should've been cached in the CDN is requested again multiple times.
Images seems to stay in the cache if they're requested very frequently (setting up a pingdom page speed test doesn't show the problem, it's executing every 30 minutes).
Additionally, if I request an image (using the browser), the scaled image is requested from the origin and delivered, but the second request doesn't return a cached file from the CDN but origin is called again. The third request returns from the CDN.
The origin is a web app which scales and delivers the requested images. All requests for images have the following headers which might affect caching:
cache-control: max-age=31536000, s-maxage=31536000
ETag: e7bac8d5-3433-4ce3-9b09-49412ac43c12?cache=Always&maxheight=3200&maxwidth=3200&width=406&quality=85
Since we want the CDN to cache the scaled image, Azure CDN Endpoint is configured to cache every unique url and the caching behaviour is "Set if missing" (although all responses have the headers above).
Using the same origin with AWS Cloudfront works perfectly (but since we have everything else in Azure, it would be nice to make it work). I haven't been able to find if there's any limit or constraints for the ETag but since it works with AWS it seems like I'm missing something related to either Azure or Verizon.
My application is displaying images stored in AWS S3 (in a private bucket for security reasons).
To allow users to see the images from their browser I generate signed URLs like https://s3.eu-central-1.amazonaws.com/my.bucket/stuff/images/image.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...&X-Amz-Date=20170701T195504Z&X-Amz-Expires=900&X-Amz-Signature=bbe277...3358e8&X-Amz-SignedHeaders=host.
This is working perfectly with <img src="S3URL" />: the images are correctly displayed.
I can even directly view the images in another tab by copy/pasting their URL.
I'm also generating PDFs embedding these images which need to be transformed before with a canvas: resized and watermarked.
But the library I use for resizing is having some troubles:
Failed to execute 'getImageData' on 'CanvasRenderingContext2D':
The canvas has been tainted by cross-origin data.
Indeed we are in a CORS context but I've setup everything so that the images can be displayed to the user and indeed it's working.
So I'm not sure to understand the reason of this error: is this another CORS security layer: the browser fears that I might change the image in a malicious purpose?
I've tried to set a permissive CORS configuration on the S3 bucket:
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>POST</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>
<MaxAgeSeconds>3000</MaxAgeSeconds>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>
And img.crossOrigin = "" or img.crossOrigin = "Anonymous" on the client-side but then I get:
Access to Image at 'https://s3.eu-central-1.amazonaws.com/...'
from origin 'http://localhost:5000' has been blocked by CORS policy:
No 'Access-Control-Allow-Origin' header is present on the requested resource.
Origin 'http://localhost:5000' is therefore not allowed access.
Which AWS/S3-side and/or client-side configuration could be missing?
One workaround here is to prevent the browser from caching the downloaded object. This seems to stem from arguably incorrect behavior on the part of S3 that interacts with the way Chrome handles cached objects. I recently answered a similar question on Server Fault, and you can find additional details, there.
The problem seems to arise when you fetch an object from S3 from simple HTML (like an <img> tag) and then fetch the same object again in a cross-origin context.
Chrome caches the result of the first request, and then uses that cached response instead of making a new request the second time. When it examines the cached object, there's no Access-Control-Allow-Origin header, because it was cached from a request that wasn't subject to CORS rules... so when that first request was made, the browser didn't send an Origin header. Because of that, S3 didn't respond with an Access-Control-Allow-Origin header (or any CORS-related headers).
The root of the problem seems related to the HTTP Vary: response header, which is related to caching.
A web server (S3 in this case) can use the Vary: response header to signal to the browser that the server is capable of producing more than one representation of the object being returned -- and if the browser would vary an attribute of the request, the response might differ. When the browser is considering using a cached object, it should check whether the object is valid in the new context, before concluding that the cached object is suited to the current need.
Indeed, when you send an Origin request header to S3, you get a response that includes Vary: Origin. This tells the browser that if the origin sent in the request had been a different value, the response might also have been different -- for example, because not all origins might be allowed.
The first part of the underlying problem is that S3 -- arguably -- should always return Vary: Origin whenever CORS is configured on the bucket, even if the browser didn't send an origin header, because Vary can be specified against a header you didn't actually include in the request, to tell you that if you had included it, the response might have differed. But, it doesn't do that, when Origin isn't present.
The second part of the problem is that Chrome -- when it consults its internal cache -- sees that it already has a copy of the object. The response that seeded the cache did not include Vary, so Chrome assumes this object is also perfectly valid for the CORS request. Clearly, it isn't, since when Chrome tries to use the object, it finds that the cross-origin response headers are missing. Presumably, had Chrome received a Vary: Origin response from S3 on the original request, it would have realized that its provisional request headers for the second request included Origin:, so it would correctly go and fetch a different copy of the object. If it did that, the problem would go away -- as we have illustrated by setting Cache-Control: no-cache on the object, preventing Chrome from caching it. But, it doesn't.
So, we work around this by setting Cache-Control: no-cache on the object in S3, so that Chrome won't cache the first one, and will make the correct CORS request for the second one, instead of trying to use the cached copy, which will fail.
Note that if you want to avoid updating your objects in S3 to include the Cache-Control: no-cache response, there is another option for solving this without actually adding the header to the objects at rest in S3. Actually, there are two more options:
The S3 API respects a value passed in the query string of response-cache-control=no-cache. Adding this to the signed URL will direct S3 to add the header to the response, regardless of the Cache-Control metadata value stored with the object (or lack thereof). You can't simply append this to the query string -- you have to add it as part of the URL signing process. But once you add that to your code, your objects will be returned with Cache-Control: no-cache in the response headers.
Or, if you can generate these two signed URLs for the same object separately when rendering the page, simply change the expiration time on one of the signed URLs, relative to the other. Make it one minute longer, or something along those lines. Changing the expiration time from one to the other will force the two signed URLs to be different, and two different objects with two different query strings should be interpreted by Chrome as two separate objects, which should also eliminate the incorrect usage of the first cached object to serve the other request.
I already have allow any origin on S3 even though I am only fetching from exactly one origin, yet this problem continued so I don't see how this can actually be a CORS problem. As others have said, it is mostly a browser bug. The problem appears if you have to fetch the image from an tag AND at any point later also do a javascript fetch for it...you'll get the cache bug. The easiest solution for me was to put query parameters at the end of the URL that would be used by javascript.
https://something.s3.amazonaws.com/something.jpg?stopGivingMeHeadaches=true
I am using cloudflare dns subdomain which is pointing to amazon s3 bucket. The problem I am facing is cloudflare cache 404 response from amazon s3. Even I upload image to amazon s3 , cloudflare always response in 404 because the previous response is cached . I want to use cloudflare cache because of performace reasons but I don't want to manually clear cloudflare cache for 404 urls.
It is obvious that if amazon s3 is responding with 404 then it is no use to cache that url.
May be I am skipping some cloudflare setting which do this.
CloudFlare actually caches 404s for about ten minutes (lightens the potential load on your server). Have you looked at purging your cache as one workaround?
I have this problem not only for S3, but for web server responses.
Seems like the solution they give is to add a no-cache header on the server response for 404s.