Squid ICAP Rewriting Content - proxy

I have a need for the following:
Web App at http://mything.com/.
Reverse Proxy Server at http://mything.com/proxy/
Reverse Proxy should proxy: http://mything.com/proxy/http://bbc.co.uk/ for example.
The Reverse Proxy should rewrite all links, src, href, and so forth in the source page (htp://bbc.co.uk) to rebase them from http://mything.com/proxy/$1.
We have a 90% working implementation in Ruby, developed as a Rack app, a series of 8 middlewares that do various jobs:
Rewrite cookie domain
Inject a javascript chunk from our site
Rewrite links to fit the proxy.
Admittedly our Ruby version isn't bad, but it requires about one second to process a page, which is shockingly slow, and simply doesn't blend. It gets an order of magnitude worse when we bring SSL into the mix, and the management are talking about stripping all SSL out, and only working with http. (Which naturally is short sighted, and incredibly foolish.)
This kind of problem simply has to have been more widely solved by tools such as Squid and protocol drafts such as ICAP.
I'd like to move away from our home grown solution as it's a source of constant pain, and it's slow.
I've just come to learn about Squid's Content Adaptation function, but haven't been able to find a decent way to make it work.
I can see the following problems which I haven't been able to solve:
How to extract the URL component from the original request URI?
i.e how can I take .../proxy/(.*) and use that as the backend address?
How to rewrite cookies in an ICAP program?
i.e does the ICAP content adaptation program receive the full request?
How to rewrite urls, links, etc with ICAP? (Squirm looks promising, but looks like things have to be hard coded (see point #1.)
i.e are there any smarter tools that simply using regexes/etc?
I know this question is at risk of being flagged "demonstrates no evidence of research", but I'm at the limits of what I've been able to ascertain without investing a number of days (or so it would appear) to learn the intricacies of Squid, when the only thing I need is to rewrite some content!

Related

MVC Routing - Using NGINX to route instead of PHP (or Ruby, etc.)

I'm currently messing around with making a custom MVC framework for educational (and, if it's good, actual practical) use, and I like to investigate different scenarios for possible performance boosts.
When it comes to URI routing, I'm familiar with the standard URI format of
/controller/action/id
And parsing the data out of this to control the routing wouldn't be too difficult. Now, what I'm moreso wondering about is the performance difference between having nginx parse this URI string out into a query string of some type to pass to a controller directly, so that would end up like
/foo/bar/12 => /application/foo.php?action=bar&id=12
instead of
/foo/bar/12 => /index.php?controller=foo&action=bar&id=12
or even
/foo/bar/12 => /index.php?uri=/foo/bar/12 (note that this would be encoded)
I'm aware that nginx passes the url, query string, and other things to php-fpm in other variables already, but this is just for illustrative purposes to show what I'm thinking.
Is this a stupid thing to do? I know that by defining routes explicitly in nginx would mean me needing to restart nginx every time I alter the routes in the config, which could be a downside.
So, to restate the question: When it comes to MVC routing, is there any worthwhile performance gain by having the actual webserver (in this case, nginx) itself handle the routing to the controller OR is using a standard landing script (like index.php in the root of the directory) and passing along the URI to be parsed for routing perfectly fine?
Thanks ahead of time. Also, I'm just learning about these things, so I wholeheartedly welcome suggestions on what I should be doing instead.
I wouldn't mix application logic (URL routing) into your HTTP server. Lots of PHP apps used to rely on Apache .htaccess files for this sort of thing. It ends up being a mess.
As you mention, it would require restarting Nginx to change routes, and it would also tie your application to Nginx, unless you wanted to rewrite all your rules for another HTTP server at some future date. Even worse, if you decide to scale your app out over more than one server, you'll have to repeat these rules for each upstream.
tl;dr Keep your layers separate.

Crawlers/SEO Friendly/Mod Rewrite/It doesn't make any sense

So I am attached to this rather annoying project where a clients client is all nit picky about the little things and he's giving my guy hell who is gladly returning the favor by following the good old rule of shoving shi* down the chain of command.
Now my question. The application consists basically of 3 different mini projects. The backend interface for the administrator, backend interface for the client and the frontend for everyone.
I was specifically asked to apply MOD_REWRITE rules to make things SEO friendly. That was the ultimate aim, so this was basically an exercise in making things more search friendly rather than making the links aesthetically better looking.
So I worked on the frontend, which is basically the landing page for everyone. It looks beautiful, the links are at worst followed by one backslash.
My clients issue. He wants to know why the backend interfaces for the admin and user are still displaying those gigantic ugly links. And these are very very ugly links, I am talking three to four backslashes followed by various get sequences and what not, so you can probably understand the complexities behind MOD_REWRITING something such as this.
In the spur of the moment I said that I left it the way it was to make sure the backend interface wouldn't be sniffed up by any crawlers.
But I am not sure if that's necessarily true. Where do crawlers stop? When do they give up on trying to parse links? I know I can use a .robot file to specify rules. But, as indigenous creatures, what are their instincts?
I know this is more of a rant than anything and I am running a very high risk of having my first question rejected :| But hey, it feels good to have this off my chest.
Cheers!
Where do crawlers stop? When do they give up on trying to parse links?
Robots.txt does not work for all bots.
You can use basic authentication or limited access by IP to hide back-end, if no files are needed for front-end.
If not practicable, try to send 404 or 401 headers for back-end files. But this is just an idea, no guarantee.
But, as indigenous creatures, what are their instincts?
Hyperlinks, toolbars and browser-sided, pre-activated functions for malware-, spam- and fraud-warnings...

How effective is ajaxcrawling compared to serverside created website SEO?

I'm looking for real world experiences in regards to ajaxcrawling:
http://code.google.com/web/ajaxcrawling/index.html
I'm particularly concerned about the infamous Gizmodo failure of late, I know I can find them via Google now, but it's not clear to me how effective this method of ajaxcrawling is in comparison to serverside generated sites is.
I would like to make a wiki that lives mostly on the client side, and which is populated by ajax json. It just feels more fluid, and I think it would be a pluspoint over my competition. (wikipedia, wikimedia)
Obviously, for a wiki it's incredibly important to have working SEO.
I would be very happy for any experiences you have had dealing with clientside development.
My research shows that the general consensus on the web right now is, that you should absolutely avoid doing ajax sites unless you don't care about SEO (for example, a portfolio site, a corporate site etc).
Well, these SEO problems arise when you have a single page that loads content dynamically based on sophisticated client-side behavior. Spiders aren't always smart enough to know when JavaScript is being injected, so if they can't follow links to get to your content, most of them won't understand what's going on in a predictable way, and thus won't be able to fullly index your site.
If you could have the option of unique URLs that lead to static content, even if they all route back to a single page by a URL rewriting scheme, that could solve the problem. Also, it will yield huge benefits down the road when you've got a lot of traffic -- the whole page can be cached at the web server/proxy level, leading to less load on your servers.
Hope that helps.

Mixing Secure and Non-Secure Content on Web Pages - Is it a good idea?

I'm trying to come up with ways to speed up my secure web site. Because there are a lot of CSS images that need to be loaded, it can slow down the site since secure resources are not cached to disk by the browser and must be retrieved more often than they really need to.
One thing I was considering is perhaps moving style-based images and javascript libraries to a non-secure sub-domain so that the browser could cache these resources that don't pose a security risk (a gradient isn't exactly sensitive material).
I wanted to see what other people thought about doing something like this. Is this a feasible idea or should I go about optimizing my site in other ways like using CSS sprite-maps, etc. to reduce requests and bandwidth?
Browsers (especially IE) get jumpy about this and alert users that there's mixed content on the page. We tried it and had a couple of users call in to question the security of our site. I wouldn't recommend it. Having users lose their sense of security when using your site is not worth the added speed.
Do not mix content, there is nothing more annoying then having to go and click the yes button on that dialog. I wish IE would let me always select show mixed content sites. As Chris said don't do it.
If you want to optimize your site, there are plenty of ways, if SSL is the only way left buy a hardware accelerator....hmmm if you load an image using http will it be cached if you load it with https? Just a side question that I need to go find out.
Be aware that in IE 7 there are issues with mixing secure and non-secure items on the same page, so this may result in some users not being able to view all the content of your pages properly. Not that I endorse IE 7, but recently I had to look into this issue, and it's a pain to deal with.
This is not advisable at all. The reason browsers give you such trouble about insecure content on secure pages is it exposes information about the current session and leaves you vulnerable to man-in-the-middle attacks. I'll grant there probably isn't much a 3rd party could do to sniff venerable info if the only insecured content is images, but CSS can contain reference to javascript/vbscript via behavior files (IE). If your javascript is served insecurely, there isn't much that can be done to prevent a rouge script scraping your webpage at an inopportune time.
At best, you might be able to get a way with iframing secure content to keep the look and feel. As a consumer I really don't like it, but as a web developer I've had to do that before due to no other pragmatic options. But, frankly, there's just as many if not more defects with that, too, as after all, you're hoping that something doesn't violate the integrity of the insecure content so that it may host the secure content and not some alternate content.
It's just not a great idea from a security perspective.

websites urls without file extension?

When I look at Amazon.com and I see their URL for pages, it does not have .htm, .html or .php at the end of the URL.
It is like:
http://www.amazon.com/books-used-books-textbooks/b/ref=topnav_storetab_b?ie=UTF8&node=283155
Why and how? What kind of extension is that?
Your browser doesn't care about the extension of the file, only the content type that the server reports. (Well, unless you use IE because at Microsoft they think they know more about what you're serving up than you do). If your server reports that the content being served up is Content-Type: text/html, then your browser is supposed to treat it like it's HTML no matter what the file name is.
Typically, it's implemented using a URL rewriting scheme of some description. The basic notion is that the web should be moving to addressing resources with proper URIs, not classic old URLs which leak implementation detail, and which are vulnerable to future changes as a result.
A thorough discussion of the topic can be found in Tim Berners-Lee's article Cool URIs Don't Change, which argues in favour of reducing the irrelevant cruft in URIs as a means of helping to avoid the problems that occur when implementations do change, and when resources do move to a different URL. The article itself contains good general advice on planning out a URI scheme, and is well worth a read.
More specifically than most of these answers:
Web content doesn't use the file extension to determine what kind of file is being served (unless you're Internet Explorer). Instead, they use the Content-type HTTP header, which is sent down the wire before the content of the image, HTML page, download, or whatever. For example:
Content-type: text/html
denotes that the page you are viewing should be interpreted as HTML, and
Content-type: image/png
denotes that the page is a PNG image.
Web servers often use the file extension if the file is served directly from disk to determine what Content-type to assign, but web applications can also generate pages with any Content-type they like in response to a request. No matter the filename's structure or extension, so long as the actual content of the page matches with the declared Content-type, the data renders as intended.
For websites that use Apache, they are probably using mod_rewrite that enables them to rewrite URLS (and make them more user and SEO friendly)
You can read more here http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html
and here http://www.sitepoint.com/article/apache-mod_rewrite-examples/
EDIT: There are rewriting modules for IIS as well.
Traditionally the file extension represents the file that is being served.
For example
http://someserver/somepath/image.jpg
Later that same approach was used to allow a script process the parameter
http://somerverser/somepath/script.php?param=1234&other=7890
In this case the file was a php script that process the "request" and presented a dinamically created file.
Nowadays, the applications are much more complex than that ( namely amazon that you metioned )
Then there is no a single script that handles the request ( but a much more complex app wit several files/methods/functions/object etc ) , and the url is more like the entry point for a web application ( it may have an script behind but that another thing ) so now web apps like amazon, and yes stackoverflow don't show an file in the URL but anything comming is processed by the app in the server side.
websites urls without file extension?
Here I questions represents the webapp and 322747 the parameter
I hope this little explanation helps you to understand better all the other answers.
Well how about a having an index.html file in the directory and then you type the path into the browser? I see that my Firefox and IE7 both put the trailing slash in automatically, I don't have to type it. This is more suited to people like me that do not think every single url on earth should invoke php, perl, cgi and 10,000 other applications just in order to sent a few kilobytes of data.
A lot of people are using an more "RESTful" type architecture... or at least, REST-looking URLs.
This site (StackOverflow) dosn't show a file extension... it's using ASP.NET MVC.
Depending on the settings of your server you can use (or not) any extension you want. You could even set extensions to be ".JamesRocks" but it won't be very helpful :)
Anyways just in case you're new to web programming all that gibberish on the end there are arguments to a GET operation, and not the page's extension.
A number of posts have mentioned this, and I'll weigh in. It absolutely is a URL rewriting system, and a number of platforms have ways to implement this.
I've worked for a few larger ecommerce sites, and it is now a very important part of the web presence, and offers a number of advantages.
I would recommend taking the technology you want to work with, and researching samples of the URL rewriting mechanism for that platform. For .NET, for example, there google 'asp.net url rewriting' or use an add-on framework like MVC, which does this functionality out of the box.
In Django (a web application framework for python), you design the URLs yourself, independent of any file name, or even any path on the server for that matter.
You just say something like "I want /news/<number>/ urls to be handled by this function"

Resources