How to differentiate from the server side, between the first request of the browser (HTML file) and the following (images, CSS, scripts...)? - mod-rewrite

I'm programming a website with SEO friendly links, ie, put the page title or other descriptive text in the link, separated by slashes. For example: h*tp://www.domain.com/section/page-title-bla-bla-bla/.
I redirect the request to the main script with mod_rewrite, but links in script, img and link tags are not resolved correctly. For example: assuming you are visiting the above link, the tag request the file at the URL h*tp://www.domain.com/section/page-title-bla-bla-bla/js/file.js, but the file is actually http://www.domain.com/js/file.js
I do not want to use a variable or constant in all HTML file URLs.
I'm trying to redirect client requests to a directory or to another of the server. It is possible to distinguish the first request for a page, which comes after? It is possible to do with mod_rewrite for Apache, or PHP?
I hope I explained well:)
Thanks in advance.

Using rewrite rules to fix the problem of relative paths is unwise and has numberous downsides.
Firstly, it makes things more difficult to maintain because there are hundreds of different links in your system.
Secondly and more seriously, you destroy cacheability. A resource requested from here:
http://www.domain.com/section/page-title-bla-bla-bla/js/file.js
will be regarded as a different resource from
http://www.domain.com/section/some-other-page-title/js/file.js
and loaded two times, causing the number of requests to grow dozenfold.
What to do?
Fix the root cause of the problem instead: Use absolute paths
<script src="/js/file.js">
or a constant, or if all else fails the <base> tag.

This is an issue of resolving relative URIs. Judging by your description, it seems that you reference the other resources using relative URI paths: In /section/page-title-bla-bla-bla a URI reference like js/file.js or ./js/file.js would be resolved to /section/page-title-bla-bla-bla/js/file.js.
To always reference /js/file.js independet from the actual base URI path, use the absolute path /js/file.js. Another solution would be to set the base URI explicitly to / using the BASE element (but note that this will affect all relative URIs).

Related

Why is a URL not returned when using the pathto(document) helper function?

According to the Sphinx docs on templating, pathto(document) returns the path to a Sphinx document as a URL." By definition, a URL typically includes the scheme, the host, and the path component. So, why does pathto(root_doc) return index.html instead of, for example, https://example.com/index.html? index.html is not a URL. Can someone enlighten me? What is the best way to get an absolute URL for a document in Sphinx?
Use the setting html_baseurl. Without setting that value, the default is '', which omits the protocol and host.

ignoring last uri segment via mod_rewrite or CodeIgniter

I was just wondering if it is possible to ignore the last URI segment of my application via either mod_rewrite or CodeIgniter. I don't want a redirect away or a remove the URI segment. I just want my app to not know it exists. So in the browser the client will see:
http://example.com/keep/keep/ignore/
but the app is only aware of:
http://example.com/keep/keep/
The idea is, if JavaScript detects /ignore/ in the URI, it will trigger an action.
/ignore/ may appear as 1st, 2nd, 3rd or 4th segment, but will only ever appear as the final one and may sometimes not appear at all.
I found some info online about ignoring sub-directories with mod-rewrite, but none of them really work like this.
**
Incase any CodeIgniters suggest passing it as an unused extra parameter to my method - The app has far too many controllers and far too many wildcard routes for this to work site wide.
I think a mod_rewrite solution would be best if possible. If not, perhaps it can be done with a CodeIgniter pre-controller hook or something, but I'm not sure how that would work.
EDIT: How I got it to work
For anyone else who would ever like to know the same thing - in the end I overwrote _explode_segments() in MY_URI to not include this segment.
With the URI class you can check and detect what URI's are and what they have.
$this->uri->segment(n)
Check out the user guide: http://codeigniter.com/user_guide/libraries/uri.html

Detecting URL rewrites (SEO urls)

How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate

how does URL rewrite work in plain english

I have read a lot about URL rewriting but I still don't get it.
I understand that a URL like
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=19
can be replaced with a friendlier one like
http://www.example.com/Blog/2006/12/19/
and the server code can remain unchanged because there is some filter which transforms the new URL and sends it to the old, but does it replace the URLs in the HTML of the response too?
If the server code remains unchanged then it is possible that in my returned HTML code I have links like:
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=20
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=21
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=22
This defeats the purpose of having the nice URLs if in my page I still have the old ones.
Does URL rewriting (with a filter or something) also replace this content in the HTML?
Put another way... do the rewrite rules apply for the incoming request as well as the HTML content of the response?
Thank you!
The URL rewriter simply takes the incoming URL and if it matches a certain pattern it converts it to a URL that the server understands (assuming your rewrite rules are correct).
It does mean that a specific resource can be accessed multiple ways, but this does not "defeat the point", as the point is to have nice looking URLs, which you still do.
They do not rewrite the outgoing content, only the incoming URL.

Serving images for a cakePHP site from a different sub-domain

what is the best practice for serving images from a different sub-domain(like media.host.com) in cakePHP? Should I put the images in the img folder provided by cakephp and make this folder the document root of the sub-domain, or is there a better way of doing this?
Roland.
I don't think it matters as long as the images are in the same place. In any case you can't use $html->image() as is, or you can but it's probably wiser to modify it or roll your own helper so that you don't have to always give it the full path. While you're doing that you can just make it use whatever url root you want.
all you need to do is set IMAGES_URL constant before cake does, or even overload image() in AppHelper. if you are some how using array urls for images you would also need to overload url() in AppHelper
I personally just created a separate subdomain pointing to the Cake webroot, which would only serve static files (so deny Cake's index.php, for example). A custom webroot() method in AppHelper modifies the URL for every returned webroot path (script(), image() and css() all use this) so no further code changes are needed.
I wrote a short article about it here: http://metaphoric.nl/blog/serving-static-content-from-a-separate-subdomain-in-cakephp

Resources