ManagedFusion Rewriter 404 if trailing slash is missing? - mod-rewrite

I'm using ManagedFusion Rewriter as a reverse proxy. The configuration is fairly simple:
RewriteRule ^/api/(.*)$1 [P]
This will work pretty much for any URL. However, if the URL happens to not end on a trailing slash, it will fail.
A request like this will go fine perfectly: GET api/report/
2013-10-10T11:27:11 [Rewrite] Input: http://localhost:50070/api/report/
2013-10-10T11:27:11 [Rule 0] Input: /api/report/
2013-10-10T11:27:11 [Rule 0] Rule Pattern Matched
2013-10-10T11:27:11 [Rule 0] Output:
2013-10-10T11:27:11 [Rewrite] Proxy:
2013-10-10T11:27:11 **********************************************************************************
2013-10-10T11:27:11 [Proxy] Request:
2013-10-10T11:27:12 [Proxy] System.Net.HttpWebResponse
2013-10-10T11:27:12 [Proxy] Received '200 OK'
2013-10-10T11:27:12 [Proxy] Response: http://localhost:50070/api/report/
2013-10-10T11:27:12 [Proxy] Response is being buffered
2013-10-10T11:27:12 [Proxy] Responding '200 OK'
However, a request like this will return a 404 without even making the request on the proxied URL: GET api/report/1
2013-10-10T11:27:13 [Rewrite] Input: http://localhost:50070/api/report/1
2013-10-10T11:27:13 [Rule 0] Input: /api/report/1
2013-10-10T11:27:13 [Rule 0] Rule Pattern Matched
2013-10-10T11:27:13 [Rule 0] Output:
2013-10-10T11:27:13 [Rewrite] Proxy:
(the log file finishes right here)
This is my whole configuration file:
RewriteEngine On
RewriteLog "log.txt"
RewriteLogLevel 9
RewriteRule ^/api/(.*)$1 [P]
Any idea where may I be wrong?

EDIT: My workaround has been accepted as the solution in the Rewriter codebase, so I'll make this the accepted answer. Please, still provide feedback on possible approaches to it.
Found a workaround, but I don't think this is the actual solution, so I'll answer my own question but won't accept it as an answer. (Unless I change my mind later. Fate is a fickle mistress.)
I downloaded the source code of ManagedFusion.Rewriter (the latest one, apparently from GitHub, here: and integrated it into my code base.
The class ManagedFusion.Rewriter.RewriterModule contains the following two methods:
private void context_PostResolveRequestCache(object sender, EventArgs e)
var context = new HttpContextWrapper(((HttpApplication)sender).Context);
// check to see if this is a proxy request
if (context.Items.Contains(Manager.ProxyHandlerStorageName))
private void context_PostMapRequestHandler(object sender, EventArgs e)
var context = new HttpContextWrapper(((HttpApplication)sender).Context);
// check to see if this is a proxy request
if (context.Items.Contains(Manager.ProxyHandlerStorageName))
var proxy = context.Items[Manager.ProxyHandlerStorageName] as IHttpProxyHandler;
context.RewritePath("~" + proxy.ResponseUrl.PathAndQuery);
context.Handler = proxy;
As the names imply, the first one is the handler of PostResolveRequestCache, while the second one is the handler for PostMapRequestHandler.
In both of my example requests, the PostResolveRequestCache handler was being invoked and working fine. However, for my failing request, PostMapRequestHandler was not being executed.
This made me think that maybe, for some reason, rewriting a specific resource that does not look like a directory to a resource that looks like a file through the usage of RewritePath was preventing the actual actual handler from being picked up, thus preventing the raising of PostMapRequestHandler.
As such, I upgraded the Rewriter project from .NET 3.5 to 4.5 and replaced these lines:
if (context.Items.Contains(Manager.ProxyHandlerStorageName))
by these ones
if (context.Items.Contains(Manager.ProxyHandlerStorageName)) {
var proxyHandler = context.Items[Manager.ProxyHandlerStorageName] as IHttpHandler;
With this, all the requests were being properly picked up by the handler and started working.
As a side note, I had some mistakes in the original rules, instead of
RewriteRule ^/api/(.*)$1 [P]
It should have been:
RewriteRule ^/api/(.*)$1 [QSA,P,NC]
QSA to append the query string of the original request
NC to match the regex case insensitive


Airflow SimpleHttpOperator for HTTPS

I'm trying to use SimpleHttpOperator for consuming a RESTful API. But, As the name suggests, it only supporting HTTP protocol where I need to consume a HTTPS URI. so, now, I have to use either "requests" object from Python or handle the invocation from within the application code. But, It may not be a standard way. so, I'm looking for any other options available to consume HTTPS URI from within Airflow. Thanks.
I dove into this and am pretty sure that this behavior is a bug in airflow. I have created a ticket for it here:
For now, the best you can do is override SimpleHttpOperator as well as HttpHook in order to change the way that HttpHook.get_conn works (to accept https). I may end up doing this, and if I do I'll post some code.
Operator override:
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.exceptions import AirflowException
from operators.https_support.https_hook import HttpsHook
class HttpsOperator(SimpleHttpOperator):
def execute(self, context):
http = HttpsHook(self.method, http_conn_id=self.http_conn_id)"Calling HTTP method")
response =,,
if self.response_check:
if not self.response_check(response):
raise AirflowException("Response check returned False.")
if self.xcom_push_flag:
return response.text
Hook override
from airflow.hooks.http_hook import HttpHook
import requests
class HttpsHook(HttpHook):
def get_conn(self, headers):
Returns http session for use with requests. Supports https.
conn = self.get_connection(self.http_conn_id)
session = requests.Session()
if "://" in
self.base_url =
elif conn.schema:
self.base_url = conn.schema + "://" +
elif conn.conn_type: # https support
self.base_url = conn.conn_type + "://" +
# schema defaults to HTTP
self.base_url = "http://" +
if conn.port:
self.base_url = self.base_url + ":" + str(conn.port) + "/"
if conn.login:
session.auth = (conn.login, conn.password)
if headers:
return session
Drop-in replacement for SimpleHttpOperator.
This is a couple of months old now, but for what it is worth I did not have any issue with making an HTTPS call on Airflow 1.10.2.
In my initial test I was making a request for templates from sendgrid, so the connection was set up like this:
Conn Id : sendgrid_templates_test
Conn Type : HTTP
Host :
Extra : { "authorization": "Bearer [my token]"}
and then in the dag code:
get_templates = SimpleHttpOperator(
http_conn_id = 'sendgrid_templates_test',
and that worked. Also notice that my request happens after a Branch Operator, so I needed to set the trigger rule appropriately (to "all_done" to make sure it fires even when one of the branches is skipped), which has nothing to do with the question, but I just wanted to point it out.
Now to be clear, I did get an Insecure Request warning as I did not have certificate verification enabled. But you can see the resulting logs below
[2019-02-21 16:15:01,333] {} INFO - Calling HTTP method
[2019-02-21 16:15:01,336] {} INFO - [2019-02-21 16:15:01,335] {} INFO - Using connection to: id: sendgrid_templates_test. Host:, Port: None, Schema: None, Login: None, Password: XXXXXXXX, extra: {'authorization': 'Bearer [my token]'}
[2019-02-21 16:15:01,338] {} INFO - [2019-02-21 16:15:01,337] {} INFO - Sending 'GET' to url:
[2019-02-21 16:15:01,956] {} WARNING - /home/csconnell/.pyenv/versions/airflow/lib/python3.6/site-packages/urllib3/ InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See:
[2019-02-21 16:15:05,242] {} INFO - [2019-02-21 16:15:05,241] {} INFO - Task exited with return code 0
I was having the same problem with HTTP/HTTPS when trying to set the connections using environment variables (although it works when i set the connection on the UI).
I've checked the issue #melchoir55 opened ( and you don't need to make a custom operator for that, the problem is not that HttpHook or HttpOperator can't use HTTPS, the problem is the way get_hook parse the connection string when dealing with HTTP, it actually understand that the first part (http:// or https://) is the connection type.
In summary, you don't need a custom operator, you can just set the connection in your env as the following:
Instead of:
Or set the connection on the UI.
It is not a intuitive way to set up a connection but I think they are working on a better way to parse connections for Ariflow 2.0.
In Airflow 2.x you can use https URLs by passing https for schema value while setting up your connection and can still use SimpleHttpOperator like shown below.
my_api = SimpleHttpOperator(
headers={"Content-Type": "application/json"},
Instead of implementing HttpsHook, we could just put one line of codes into HttpsOperator(SimpleHttpOperator)#above as follows
self.extra_options['verify'] = True
response =,,
in Airflow 2, the problem is been resolved.
just check out that :
host name in Connection UI Form, don't end up with /
'endpoint' parameter of SimpleHttpOperator starts with /
I am using Airflow 2.1.0,and the following setting works for https API
In connection UI, setting host name as usual, no need to specify 'https' in schema field, don't forget to set login account and password if your API server request ones.
Connection UI Setting
When constructing your task, add extra_options parameter in SimpleHttpOperator, and put your CA_bundle certification file path as the value for key verify, if you don't have a certification file, then use false to skip verification.
Task definition
Reference: here

Apache: just rewrite if external ressource exists

I use Apache as a reverse proxy. There is no web content on the dedicated server itself. If a client requests a resource on the local Apache server, Apache should determine on which remote (proxied) server the resource exists and do a proxy rewrite to that server.
A snippet should (that currently does not work) should demonstrate, what i would do:
RewriteCond{REQUEST_URI} -U
RewriteRule ^(.*)$$1 [P]
I spared out the rest of my configuration (ProxyPass, ProxyPassReverse, other RewriteCond,...) to focus on my problem:
How could I check if an external resource exists / is available before rewriting?
The -U option for RewriteCond returns alwas true. The -F option returns alwas false. Is there a working solution for my intent?
After searching for weeks to get the solution I come to the conclusion: there is no reliable RewriteRule if an external ressource exists.
You go much better if you address your service behind an reverse proxy via subdomains. E.g. '' if you want to address a ressource on your gitlab server behind your reverse proxy. So the reverse proxy does not become confused if the ressource is lying in the root directory '/' of the gitlab server.
I had the same problem but, as far as I know, I got same results: it is not possible do it using only Apache httpd directives (at least with the version 2.2).
In my solution I did it using a RewriteMap and a PHP script able to check if the external resource exists.
In this example, when a new request comes, RewriteMap check the existence of requested path on Server A and, if successfully found, it reverse proxy the request on same server.
On the other hand, if the requested path is not found on Server A, it implements a rewrite rule to reverse proxy the request on serverB.
As said, I have used a RewriteMap with MapType prg: and a PHP script.
Here the Apache directives:
# Please pay attention to RewriteLock
# this directive must be defined in server config context
RewriteLock /tmp/if_url_exists.lock
RewriteEngine On
ProxyPreserveHost Off
ProxyRequests Off
RewriteMap url_exists "prg:/usr/bin/php /opt/local/scripts/url_exists.php"
RewriteCond ${url_exists:http://serverA%{REQUEST_URI}} >0
RewriteRule . http://serverA%{REQUEST_URI} [P,L]
RewriteRule . http://serverB%{REQUEST_URI} [P,L]
Here comes the interesting and tricky part.
This is the url_exists.php script, executed by Apache. It is waiting on the standard input stream and write into standard output.
This scripts return 1 if the resource is found and readable, otherwise 0.
It is so light even because it implements only an HTTP request using the HEAD method.
function check_if_url_exists($line) {
$curl_inst = curl_init($line);
curl_setopt( $curl_inst, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt( $curl_inst, CURLOPT_LOW_SPEED_LIMIT, 1);
curl_setopt( $curl_inst, CURLOPT_LOW_SPEED_TIME, 180);
curl_setopt( $curl_inst, CURLOPT_HEADER, true);
curl_setopt( $curl_inst, CURLOPT_FAILONERROR, true);
// Exclude the body from the output and request method is set to HEAD.
curl_setopt( $curl_inst, CURLOPT_NOBODY, true);
curl_setopt( $curl_inst, CURLOPT_FOLLOWLOCATION, true);
curl_setopt( $curl_inst, CURLOPT_RETURNTRANSFER, true);
$raw = curl_exec($curl_inst);
return ($raw != false) ? true : false;
$keyboard = fopen("php://stdin","r");
while (true) {
$line = trim(fgets($keyboard));
if (!empty($line)) {
$str = (check_if_url_exists($line)) ? "1" : "0";
echo $str."\n";

how to skip some file type while crawling with scrapy?

I want to skip some file type link .exe .zip .pdf while crawling with scrapy, but don't want to use Rule with specific url regular. How?
Due to that it's hard to decide whether to follow this link just by Content-Type in response when the body hasn't been downloaded. I change to drop url in downloader middleware. thanks Peter and Leo.
If you go to within the Scrapy root directory, you will see the following:
Common code and definitions used by Link extractors (located in
# common file extensions that are not followed if they occur in links
# images
'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',
# audio
'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
# video
'3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
# other
'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar',
However, since this applies to the linkextractor (and you don't want to use Rules), I am not sure that this will solve your problem (I just realized you specified that you didn't want to use Rules. I thought you had asked how to change the file-extension restrictions without needing to specify directly in a rule).
The good news is, you can also build your own downloader middleware and drop any/all requests to urls which have an undesirable extension. See Downloader Middlerware
You can get the requested url by accessing the request object's url attribute as follows: request.url
Basically, search the end of the string for '.exe' or whatever extension you want to drop, and if it contains said extentions, return an IgnoreRequest exception, and the request will immediately be dropped.
In order to process the request prior to it being downloaded, you need to make sure you define the 'process_request' method within your custom downloader middleware.
According to the Scrapy documentation
This method is called for each request that goes through the download
process_request() should return either None, a Response object, or a
Request object.
If it returns None, Scrapy will continue processing this request,
executing all other middlewares until, finally, the appropriate
downloader handler is called the request performed (and its response
If it returns a Response object, Scrapy won’t bother calling ANY other
request or exception middleware, or the appropriate download
function; it’ll return that Response. Response middleware is always
called on every Response.
If it returns a Request object, the returned request will be
rescheduled (in the Scheduler) to be downloaded in the future. The
callback of the original request will always be called. If the new
request has a callback it will be called with the response
downloaded, and the output of that callback will then be passed to the
original callback. If the new request doesn’t have a callback, the
response downloaded will be just passed to the original request
If it returns an IgnoreRequest exception, the entire request will be
dropped completely and its callback never called.
So essentially, just create a downloader class, add a method class process_request, which takes a request object and spider object as parameters. Then return the IgnoreRequest exception if the url contains unwanted extensions.
This should all occur prior to the page being downloaded. However, if you are wanting to process the response headers instead, than a request will have to be made to the webpage.
You could always implement both a process_request and process_response method in the downloader, with the idea being that obvious extensions will immediately be dropped, and than, if for some reason the url did not contain the file extension, the request would be process and caught in the process_request method (since you could verify in the headers)?
.zip and .pdf are ignored by scrapy by default.
As a general rule you can either configure a rule to include only urls that match your regexp (.htm* in this case):
rules = (Rule(SgmlLinkExtractor(allow=('\.htm')), callback='parse_page', follow=True, ), )
or exclude the ones that match a regexp:
rules = (Rule(SgmlLinkExtractor(allow=('.*'), deny=('\.pdf', '\.zip')), callback='parse_page', follow=True, ), )
Read the documentation for more information.
I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:
from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re
class FilterResponses(object):
"""Limit the HTTP response types that Scrapy dowloads."""
def is_valid_response(type_whitelist, content_type_header):
for type_regex in type_whitelist:
if, content_type_header):
return True
return False
def process_response(self, request, response, spider):
Only allow HTTP response types that that match the given list of
filtering regexs
# to specify on a per-spider basis
# type_whitelist = getattr(spider, "response_type_whitelist", None)
type_whitelist = (r'text', )
content_type_header = response.headers.get('content-type', None)
if not content_type_header or not type_whitelist:
return response
if self.is_valid_response(type_whitelist, content_type_header):
return response
msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
log.msg(msg, level=log.INFO)
raise IgnoreRequest()
To use it, add it to
'[project_name].middlewares.FilterResponses': 999,

`open_http': 403 Forbidden (OpenURI::HTTPError) for the string "Steve_Jobs" but not for any other string

I was going through the Ruby tutorials provided at and I encountered the following code:
require "open-uri"
remote_base_url = ""
r1 = "Steve_Wozniak"
r2 = "Steve_Jobs"
f1 = "my_copy_of-" + r1 + ".html"
f2 = "my_copy_of-" + r2 + ".html"
# read the first url
remote_full_url = remote_base_url + "/" + r1
rpage = open(remote_full_url).read
# write the first file to disk
file = open(f1, "w")
# read the first url
remote_full_url = remote_base_url + "/" + r2
rpage = open(remote_full_url).read
# write the second file to disk
file = open(f2, "w")
# open a new file:
compiled_file = open("apple-guys.html", "w")
# reopen the first and second files again
k1 = open(f1, "r")
k2 = open(f2, "r")
The code fails with the following trace:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open'
from /Users/arkidmitra/tweetfetch/samecode.rb:11
My problem is not that the code fails but that whenever I change r2 to anything other than Steve_Jobs, it works. What is happening here?
Your code runs fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.
When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.
Steve_Jobs => success
Steve_Austin => success
Steve_Rogers => success
Steve_Foo => error
Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different than other people who do exist, then best-guess this is because wikipedia is caching the Steve Jobs article because he's famous, and potentially adding extra checks/verifications to protect the article from rapid changes, defacings, etc.
The solution for you: always open the url with a User Agent string.
rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read
Details from the Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to specify a User-Agent header that properly identifies your client. Don't use the default User-Agent provided by your client library, but make up a custom header that includes the name and the version number of your client: something like "MyCuteBot/0.1".
On Wikimedia wikis, if you don't supply a User-Agent header, or you supply an empty or generic one, your request will fail with an HTTP 403 error. See our User-Agent policy."
I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:
For some pages – such as Al Gore's locked-down entry – Wikipedia will
not respond to a web request if a User-Agent isn't specified. The
"User-Agent" typically refers to your browser, and you can see this by
inspecting the headers you send for any page request in your browser.
By providing a "User-Agent" key-value pair, (I basically use "Ruby"
and it seems to work), we can pass it as a hash (I use the constant
HEADERS_HASH in the example) as the second argument of the method
It is specified later at

htaccess internal and external request distinction

I have a problem with an .htaccess file. I've tried googling but could not find anything helpful.
I have an AJAX request loading pages into the index.php. The link triggering it is getting prepended by "#" via jquery. So if you click on the link (a wordpress permalink) you get in the browser and the content will get loaded via AJAX.
My problem is: Since these are blog posts, external links grab the real link (, so I want them to get redirected to (cause then ajax checks the hash and does its magic).
Example here.
The jquery code for the prepend is:
$allLinks.each(function() {
$(this).attr('href', '#' + this.pathname);
and then the script checks
if (hash) { //we know what we want, the url is not the home page!
hash = hash.substring(1);
URL = 'http://' + + hash;
var $link = $('a[href="' + URL + '"]'), // find the link of the url
Now I am trying to get the redirect to work with htaccess. I need to check if the request is external or internal
RewriteCond %{REMOTE_HOST} !^127\.0\.0\.1 #???
and if the uri starts with "/#/" which is a problem since it's a comment then, \%23 does not really work somehow.
RewriteCond %{REQUEST_URI} !^/\%23/(.*)$ #???
How do I get this to work to simply redirect an external request from to without affecting the internal AJAX stuff?
I suppose your $allinks variable is assigned in a fashion similar to this:
$allinks = $('a');
Do this instead:
$allinks = $('a[href^="' + document.location.protocol + '//' + document.location.hostname + '"]');
This will transform internal links to your hash-y style only.
Ok i've done it with PHP here is the code
$path = $_SERVER["REQUEST_URI"];
if(isset($_SERVER['HTTP_X_REQUESTED_WITH']) && strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest') {
echo "It's ajax";
} else {
if(strpos($path, '/#/') === false) {
header("Location:".$path); //ONLY WORKS IF THERE IS NO BODY TAG
There sure is a better solution, but this does the trick for now and since the page /foo/bar does, in my case, not include the header.php there is no >body<-tag and the php "header()" function works . If anyone knows the htaccess script for this I am keen to know and learn.
