How do I bypass robots.txt AND sitemap.xml in my Varnish configuration? - sitemap

Google is having a hard time rendering my robots.txt file due to Varnish. When I try to visit the robots.txt file, I get a 503 Service Unavailable page.
I have addressed bypassing my sitemap in the following manner:
# Bypass sitemap
if (req.url ~ "/sitemap.xml") {
return (pass);
}
Is the following the appropriate syntax to bypass both items:
# Bypass sitemap
if (req.url ~ "/sitemap.xml" || req.url ~ "/robots.txt") {
return (pass);
}

The syntax is indeed correct. You could also turn this into a single regex and match the patterns even closer.
Here's an example:
sub vcl_recv {
if(req.url ~ "^/(sitemap.xml|robots.txt)(\?.*)?$") {
return(pass);
}
}
However, the fact that you get an HTTP 503 error means that Varnish cannot successfully fetch the contents from the backend for these requests. In that case it has nothing to do with the VCL code.
As described in https://www.varnish-software.com/developers/tutorials/troubleshooting-varnish/#backend-errors, you can run the following varnishlog command to figure out why these errors are being returned:
sudo varnishlog -g request -q "VCL_call eq 'BACKEND_ERROR'"
You can also tailor the command to match the /sitemap.xml and /robots.txt URLs:
sudo varnishlog -g request -q "ReqUrl ~ '^/(sitemap.xml|robots.txt)(\?.*)?$'"
If you still need help figuring out the varnishlog output, don't hesitate to add the relevant log transactions to your original question and I'll help you figure it out.

Related

regular expression inside a cURL call

I have a cURL call like this:
curl --silent --max-filesize 500 --write-out "%{http_code}\t%{url_effective}\n" 'http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.[200-210].dmg' -o /dev/null
This call generates a list of of URLs with the HTTP code (200 or 404 normally) like this:
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.203.dmg
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.204.dmg
200 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.205.dmg
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.206.dmg
The only valid URLs are the ones preceded by the 200 HTTP code, so I would like to put a regular expression in the cURL so that it only downloads the lines that start with 200
Any ideas on how to do this without being a bash script?
Thank you in advance
You can use the following :
curl --silent -f --max-filesize 500 --write-out "%{http_code}\t%{url_effective}\n" -o '#1.dmg' 'http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.[200-210].dmg'
This will try to reach every url and when it's not a 404 nor too large download it into a file whose name will be based on the index in the url.
The -f flag makes it avoid to output the content of the response when the HTTP code isn't a success one, while the -o flag specifies an output file, where #1 corresponds to the effective value of your [200-210] range (adding other [] or {} would let you refer to other parts of the URL by their index).
Note that during my tests, the --max-filesize 500 flag prevented the download of the only url which didn't end up in a 404, fmpa_17.0.2.205.dmg

Why would a web server reply with 301 and the exact location that was requested?

I'm trying to retrieve pages from a web server via https, using lua with luasec. For most pages my script works as intended, but if the ressource contains special characters (like ,'é), I'm being sent into a loop with 301 responses.
let this code sniplet illustrate my dilemma (actual server details redacted to protect the innocent):
local https = require "ssl.https"
local prefix = "https://www.example.com"
local suffix = "/S%C3%A9ance"
local body,code,headers,status = https.request(prefix .. suffix)
print(status .. " - GET was for \"" .. prefix .. suffix .. "\"")
print("headers are " .. myTostring(headers))
print("body is " .. myTostring(body))
if suffix == headers.location then
print("equal")
else
print("not equal")
end
local body,code,headers,status = https.request(prefix .. headers.location)
print(status .. " - GET was for \"" .. prefix .. suffix .. "\"")
which results in the paradoxical
HTTP/1.1 301 Moved Permanently - GET was for "https://www.example.com/S%C3%A9ance"
headers are { ["content-type"]="text/html; charset=UTF-8";["set-cookie"]="PHPSESSID=e80oo5dkouh8gh0ruit7mj28t6; path=/";["content-length"]="0";["connection"]="close";["date"]="Wed, 15 Mar 2017 19:31:24 GMT";["location"]="S%C3%A9ance";}
body is ""
equal
HTTP/1.1 301 Moved Permanently - GET was for "https://www.example.com/S%C3%A9ance"
How might one be able to retrieve the elusive pages, using lua and as little additional dependencies as possible?
Obvious as it may seem, perhaps the requested url does differ from the actual location.
If you have a similar problem, do check deep within your external libraries to make sure they do what you think they do.
In this case, luasocket did urldecode and then urlencode the url and thus the final request was not what it seemed to be.

What is the fastest way to perform a HTTP request and check for 404?

Recently I needed to check for a huge list of filenames if they exist on a server. I did this by running a for loop which tried to wget each of those files. That was efficient enough, but took about 30 minutes in this case. I wonder if there is a faster way to check whether a file exists or not (since wget is for downloading files and not performing thousands of requests).
I don't know if that information is relevant, but it's an Apache server.
Curl would be the best option in a for loop and here is a straight forward simple way, run this in your forloop
curl -I --silent http://www.yoururl/linktodetect | grep -m 1 -c 404
What this simply does is check the http response header for a 404 returned on the link and if its detected as a missing file/link throwing a 404 then the command line output will display you a number 1; otherwise, if the file/link is valid and does not return a 404 then the command line output will display you a number 0.

curl -i and curl -I returning different results

My understanding was that curl -i and curl -I would return virtually the same results except that curl -i would return the standard output along with the header and curl -I would only return the header -- the header of both being the same. We've been doing some gzip and un-gzipped testing with Varnish and stumbled upon the oddity that curl -i shows X-Cache: HIT but curl -I returns X-Cache: MISS! How this is possible, I am unsure and that is precisely my question in this post.
Here are some more details that may or may not make a difference:
The URL is usually SSL enforced (https) but both HTTP and HTTPS have been tested to receive same results
The results are consistent
Is Varnish Running site says "Yes! Sort of"
curl sends different HTTP requests to the server (or Varnish in this case) when you use the -I option. Normally, curl will send a GET request, but when you specify -I, it sends HEAD instead (essentially telling the server to just send the header, not the actual content). I'm not particularly familiar with Varnish, but it appears to normally cache both GET and HEAD requests -- but in your case it might be configured to do something different, or the backend server may be triggering a difference... In any case, I'm pretty sure it's GET vs. HEAD that's making the cache respond differently with -i vs. -I.
did you check in different orders?
see: http://anothersysadmin.wordpress.com/2008/04/22/x-cache-and-x-cache-lookup-headers-explained/ for some details on X-Cache

Using CURL to download file and view headers and status code

I'm writing a Bash script to download image files from Snapito's web page snapshot API. The API can return a variety of responses indicated by different HTTP response codes and/or some custom headers. My script is intended to be run as an automated Cron job that pulls URLs from a MySQL database and saves the screenshots to local disk.
I am using curl. I'd like to do these 3 things using a single CURL command:
Extract the HTTP response code
Extract the headers
Save the file locally (if the request was successful)
I could do this using multiple curl requests, but I want to minimize the number of times I hit Snapito's servers. Any curl experts out there?
Or if someone has a Bash script that can respond to the full documented set of Snapito API responses, that'd be awesome. Here's their API documentation.
Thanks!
Use the dump headers option:
curl -D /tmp/headers.txt http://server.com
Use curl -i (include HTTP header) - which will yield the headers, followed by a blank line, followed by the content.
You can then split out the headers / content (or use -D to save directly to file, as suggested above).
There are three options -i, -I, and -D
> curl --help | egrep '^ +\-[iID]'
-D, --dump-header FILE Write the headers to FILE
-I, --head Show document info only
-i, --include Include protocol headers in the output (H/F)

Resources