Lighttpd configuration, . (dots) in my query string cause 404 - mod-rewrite

I have an address on my site like so:
http://www.example.com/lookup?q=http%3A%2F%2Fgigaom.com%2F2010%2F10%2F10%2Fangry-birds-for-windows7-phone-dont-count-on-it%2F
In this example, the dot in the 'gigaom.com' part of the query string is screwing with lighttpd and my rewrite rules. I get a 404 with the dot in, no 404 if I take the dot out. My rewrite rules are below. In case it makes a difference, I'm using symfony 1.4.
If anyone could shed some light on this problem it would be much appreciated!
url.rewrite-once = (
"^/(.*\..+)$" => "$0",
"^/(.*)\.(.*)" => "/index.php",
"^/([^.]+)$" => "/index.php/$1",
"^/$" => "/index.php"
)
For anyone having trouble with lighttpd and symfony (I know you're out there, cause there are plenty of unresolved threads on the issue) I ended up solving and answering it below.

OK so after much debugging with the help of:
debug.log-request-handling = "enable"
^^ This is a lifesaver for when you're trying to debug rewrite rules in lighttpd! (it logged everything for me to /var/log/lighttpd/error.log)
I've figured it out. For all those people having trouble getting symfony to work with lighttpd (including the dot problem!) here's a working set of rules:
url.rewrite-once = (
"^/(js|images|uploads|css|sf)/(.*)" => "$0", # we want to load these assets as is, without index.php
"^/[a-zA-Z_-]+\.(html|txt|ico)$" => "$0", # for any static .html files you might be calling in your web root, we don't want to put the index.php controller in front of them
"^/sf[A-z]+Plugin.*" => "$0", # don't want to mess with plugin routes
"^/([a-z_]+)\.php(.*)\.(.*)$" => "/$1.php$2.$3", # same concept as rules below, except for other applications/environments (backend.php, backend_dev.php, etc)
"^/([a-z_]+)\.php([^.]*)$" => "/$1.php$2", # see comment right above this one
"^/(.*)\.(.*)$" => "/index.php/$1.$2", # handle query strings and the dot problem!
"^/([^.]+)$" => "/index.php/$1", # general requests
"^/$" => "/index.php" # the home page
)
If anyone has more trouble, post here. Thanks!

I am using PHP, MySQL, and .htaccess file for rewriting rules. Just sharing so others can also benefit.
My previous rule:
RewriteRule ^([^/\.]+)/([^/\.]+).html$ detail.php?name=$1&id=$2 [L]
It was working with this result: http://www.sitecliff.com/Yahoo-UK/4.html
I wanted the website name instead of the title in the url. After surfing the net, I figured out that the dot is causing the problem as you mentioned above.
"^/(.*)\.(.*)$" => "/index.php/$1.$2", # handle query strings and the dot problem!
So I changed the rule to:
RewriteRule ^([^/]+)$ detail.php?url=$1 [L]
After this, I got my desired result: http://www.sitecliff.com/yahoo.com

Related

Yii basic url rewrite

I am new to php, new to mvc, new to yii, and new to url-rewriting. So, I am sorry, if I am asking something very basic.
I have hide the index.php (from the htaccess method discussed in yii forums)
In my urlmanager, I have this,
'urlFormat'=>'path',
'rules'=>array(
'<controller:\w+>/<id:\d+>'=>'view',
'<controller:\w+>/<action:\w+>/<id:\d+>'=>'<controller>/<action>',
'<controller:\w+>/<action:\w+>'=>'<controller>/<action>'
),
'showScriptName'=>false,
I have 3 files in view/site folder.
'journey',
'invite',
'linkedin'
Now, my home page should redirect to 'journey' action (i.e. should open the 'site/journey.php')
So, I guess, this would be
'/' => 'site/journey'
It works too.
Now, I want 'journey/invite' should invoke the 'invite' action i.e. should open 'site/invite.php'
And, 'journey/linkedin' should invoke the 'linkedin' action i.e. 'site/linkedin.php'.
but,
'journey/invite' => 'site/invite',
'journey/linkedin' => 'site/linkedin'
is not working.
Also, can someone help me understand this,
<controller:\w+>/<id:\d+>
i.e. what is controller in url and what does 'w+' mean ?
A reference to guide will help too.
Edited after bool.dev's suggestion:
Changed the code , as you said (I tried that earlier too, removing all default rules).
Now my url manager is like,
'/' => 'site/journey',
'journey/invite' => 'site/invite',
'journey/linkedin' => 'site/linkedin',
'<controller:\w+>/<id:\d+>'=>'view',
'<controller:\w+>/<action:\w+>/<id:\d+>'=>'<controller>/<action>',
'<controller:\w+>/<action:\w+>'=>'<controller>/<action>',
But it throws an error
"Warning: require_once(): open_basedir restriction in effect.
File(/var/xyz.com/../yii/framework/yii.php) is not within the allowed
path(s):
(/usr/share/php:/usr/share/pear:/usr/share/php/libzend-framework-php:/var/*/tmp:/var/xyz.com)
in /var/xyz.com/journey.php on line 12 Warning:
require_once(/var/xyz.com/../yii/framework/yii.php): failed to open
stream: Operation not permitted in /var/xyz.com/journey.php on line 12
Fatal error: require_once(): Failed opening required
'/var/xyz.com/../yii/framework/yii.php'
(include_path='.:/usr/share/php:/usr/share/php/libzend-framework-php')
in /var/xyz.com/journey.php on line 12'
when I do xyz.com/journey/invite or even xyz.com/journey
Edit:
It was a permission issue, #bool.dev's suggestion to put specific rules on top worked :)
These two:
'journey/invite' => 'site/invite',
'journey/linkedin' => 'site/linkedin'
are not working because the url is being matched by a previous rule:
'<controller:\w+>/<action:\w+>'=>'<controller>/<action>'
To prevent that from happening just make sure that rule is mentioned last, before any specific(i.e not matched by regular expressions) rewrites, so your rules can look somewhat like this :
'journey/invite' => 'site/invite',
'journey/linkedin' => 'site/linkedin',
'<controller:\w+>/<action:\w+>'=>'<controller>/<action>'
This : \w+ means match any string with one or more occurrences of any “word” character (a-z 0-9 _) and \d+ means match any string with one or more occurrences of any digits (0-9) . Check out php regex.
Edit
Hadn't read your question thoroughly before, the controller in the rule is simply a name for the matched expression, so you could have had '<contr:\w+>/<act:\w+>'=>'<contr>/<act>'.
Edit2
After reading your edited question with the rules array, afaik you could use ''=>'site/journey' instead of '/'=>'site/journey'.

htaccess - Rewrite to capture friendly URL or querystring

I'm trying to come up with one or more rewrite rules that will take either a friendly url or a url containing the full query string.
The plan is to create a text-only page by reading in the URL using PHP's loadHTML.
For example:
Input
1. http://www.example.com/disclaimer (http://www.example.com/text/disclaimer on text-only version)
2. http://www.example.com/info/aboutus (http://www.example.com/text/info/aboutus on text-only version)
3. http://www.example.com/news?id=123 (http://www.example.com/text/news?id=123 on text-only version)
Output
1. http://www.example.com/includes/textonly.php?page=disclaimer
2. http://www.example.com/includes/textonly.php?page=info/aboutus
3. http://www.example.com/includes/textonly.php?news?id=123
So on the textonly.php I would use $_GET['page']); for example 1) and 2), and use $_SERVER['QUERY_STRING']; for example 3).
For example 1) and 2), I came up with:
RewriteRule ^text/(.*) includes/textonly.php?page=$1
And for example 3), I came up with:
RewriteRule ^text/(.[?]) /includes/textonly.php [QSA]
They work independantly but not together. Can anyone help?
With guidance from Tom and Michael, this is what I've come up with:
in Htaccess send everything to PHP in the querystring:
RewriteRule ^text/(.*) /includes/textonly.php [QSA,L]
Then in PHP:
$page = 'http://'.$_SERVER['HTTP_HOST'].'/'.str_replace('/text/','',$_SERVER['REQUEST_URI']);
Seems to work for both friendly urls (2 levels deep tested so far) and querystrings.
Hopefully its okay, so I'll go with this as a solution :)
I suggest handing control over to PHP - I wrote an article on this a while ago - http://tomsbigbox.com/elegant-url-rewriting/ - it details how to send the query string to a PHP file that then decides what to do - so if the page exists for example it will load it, otherwise do something else.
I've found that to be the best solution to URL rewriting.
I'd change your rewrite rule to look like this:
RewriteRule ^([^/\.]+)/?([^/\.]+)?/?$ /includes/textonly.php?page=$1&id=$2 [L,NC]
Then slightly modify your Input URLs to look like this:
1. http://www.example.com/disclaimer
2. http://www.example.com/info/aboutus
3. http://www.example.com/news/123
And they would point to these URLs:
1. http://www.example.com/includes/textonly.php?page=disclaimer&id=
2. http://www.example.com/includes/textonly.php?page=info&id=aboutus
3. http://www.example.com/includes/textonly.php?page=news&id=123
The regex above will match anything between the /, not including a / or a .. The second half is optional, so in that rule, you could only go two directories deep.
Controlling all three URLs in the same way will make your logic a little cleaner on the textonly.php page as you do not need to write special logic for the first two URLs compared to the last.

How to redirect - rewriten urls to short urls

I have changed my site url structure. But, Google indexed urls are giving 404 not found error. Now, I need a .htaccess rewrite rule for,
From url: www.mydomain.com/topic-titles-here-t273.html
To url: www.mydomain.com/sub-folder/topic273.html
(Topic id must be cached and topic title must be removed.)
Some times, like this also,
From url: www.mydomain.com/topic-titles-here-t273-15.html
To url: www.mydomain.com/sub-folder/topic273-15.html
I searched a lot, for about three hours, But couldn't find correct answer. Please help.
I am a little unclear on what you are trying to do but
1
RewriteRule ^([a-z|-]+)(\d+\.html)$ /sub-folder/topic$2 [NC,L]
Would take the second group (just 273.html) and append accordingly, the first group would be
everything before (topic-titles-here-t)
is that what you require? If so the regex could be tidied I just wanted to demonstrate the two groups
update - Ok according to edit the second group just becomes
(\d{3}-\d+\.html)
if 273 used in this example can be longer than 3 nubers if you know exact just amend that number otherwise use +
RewriteRule ^([a-z0-9|-]+)t(\d+\.html)$ /redirect/topic$2 [NC,R=301,L]
RewriteRule ^([a-z0-9|-]+)t(\d+-\d+\.html)$ /redirect/topic$2 [NC,R=301,L]

mod_rewrite newbie needs help with (simple?) rewrite rule

I tried. I failed. Here's what I want to do:
Using firebug I saw that the GET string in the request header for my stylesheets and other content was being munged by the application (which I can't modify). I think a simple rewrite rule might help but I can't get it to work. Here's what I need:
input: /content/2010/08/forum/styles/xyz/theme/normal.css
output: /forum/styles/xyz/theme/normal.css
input: /content/forum/styles/xyz/template/blah.js
output /forum/styles/xyz/template/blah.js
input: /content/my-own-page/forum/happy.htm
output: /forum/happy.htm
So, whatever comes in IF it contains "forum/" then get rid of what precedes "forum/" and return everything from "forum/", foward.
Something along these lines should probably do the job.
RewriteRule ^.*/forum/(.*)$ /forum/$1

Remove subdomain from string in ruby

I'm looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
How can I extend this to remove the subdomains that exist in some URLs?
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
This is a tricky issue. Some top-level domains do not accept registrations at the second level.
Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.
Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.
You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!
Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.
For posterity, here's an update from Oct 2014:
I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.
In combination with URI.parse for stripping protocol and paths, it works really well:
❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"
The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).
Ready for a complex regular expression? :)
re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip
Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).
I tested this expression on the following samples:
foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk
Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!
Something like:
def remove_subdomain(host)
# Not complete. Add all root domain to regexp
host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end
puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl
You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.
Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.
Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.
Why not just strip the .com or .co.uk and then split on '.' and get the last element?
some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1
Have to say it feels hacky. Are there any other domains like .co.uk?
I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.

Resources