URL Rewriting, SEO and encoding - url-rewriting

I found this article regarding URL Rewriting most useful.
But here are a couple of questions.
I would love to use a URL (before rewriting, with spaces inside the query string)
http://www.store.com/products.aspx?category=CD s-Dvd s
First of all, should I replace the spaces with the plus sign (+) for any reason? Like this:
http://www.store.com/products.aspx?category=CD+s-Dvd+s
Secondly, my native language is Greek. Should I encode the parameters? Generally speaking, would the result with URL encoding on be different, regarding S.E.O.?

Actually you should replace spaces with hyphens. That actually is better for SEO than using an underscore.

If the value must come through unaltered, then yes you must use escaping. In a URL query parameter value, a space may be encoded as + or %20. mod_rewrite will generally do this for you as long as the external version was suitably spelled.
In the external version of the URL, only %20 can be used:
http://www.store.com/products/CD%20s-Dvd%20s
http://www.store.com/products.php?category=CD%20s-Dvd%20s
because a + in a URL path part would literally mean a plus.
(Are you sure you want a space there? “CDs-DVDs” without the spaces would seem to be a better title.)
It is non-trivial to get arbitrary strings through from a path part to a parameter. Apart from the escaping issues, you've got problems with /, which should be encoded as %2F in a path part. However Apache will by default block any URL containing %2F for security reasons. (\ is similarly affected under Windows.) You can turn this behaviour off using the AllowEncodedSlashes config, but it means if you want to be portable you can't use “CDs/DVDs” as a category name.
For this reason, and because having a load of %20​s in your URL is a bit ugly, strings are usually turned into ‘slugs’ before being put in a URL, where all the contentious ASCII characters that would result in visible %-escapes are replaced with filler characters such as hyphen or underscore. This does mean you can't round-trip the string, so you need to store either a separate title and slug in the database to be able to look up the right entity for a given slug, or just use an additional ID in the URL (like Stack Overflow does).

General practice is to replace spaces with underscores, ala http://www.store.com/products.aspx?category=CD_s-Dvd_s

Related

WebApi Url should allow special characters like forward slash(/) , ( and )

Some of web api endpoints have strings as input parameters and some times I have to pass special characters like /,\,( and ). But, it is not allowing special characters because those has special functionalities. Is there any way solve this problem.
Is there any way solve this problem.
No.
(Please) Stop Using Unsafe Characters in URLs. URLs are by definition machine readable. There are rules to follow about what can and cannot be part of a URL.
Although technically using unsafe characters is supposed to be possible through encoding, all bets are off that all browsers, web servers, and firewalls will treat them as "plain text" when encoded instead of assigning them special meaning. Some may just reject the URL entirely, considering it SPAM or attempted hacking.

Restructured text (rst) http links underscore ('__' vs '_' use)

With restructured text, I've seen both these used:
`Some Link <http://www.some.com>`_
`Some Link <http://www.some.com>`__
Both generate the same output from Sphinx,
Whats the difference between using _ or a double underscore __ for http URL links?
Why would you one over another?
In short, if its a one-off (anonymous) URL which you don't intend to reference, use double underscore.
In practice you could use either in most cases, they generate the same HTML output for example.
However, using single underscores for links means that by default you're creating a reference target - which could conflict with other references of the same name.
So this for example will warn:
.. _Thing:
Title
=====
Text with `Thing <http://link.com>`_.
WARNING: Duplicate target name, cannot be used as a unique reference: "thing".
While this could be overlooked in most cases, it could make for confusing situations especially for anyone inexperienced with reStructuredText. So you may prefer to avoid this entirely only defining targets when that is your intention.
According to:
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#anonymous-hyperlinks
With a single trailing underscore, the reference is named and the same target URI may be referred to again. With two trailing underscores, the reference and target are both anonymous, and the target cannot be referred to again. These are "one-off" hyperlinks.
There are examples on the page links.

Sitecore - rewrite " " to "-" in urls but still allow dashes as legal item names

I've read a half dozen guides on rewriting spaces to something more friendly in Sitecore, but all of them rely on Sitecore's <encodeNameReplacements/> element which also reverses the replacement requiring "-" to be an illegal character for names.
The problem with this is that the url of our application has a "-" in the hostname. Sitecore rewrites this resulting in a 404.
Does anyone have ideas on how to do this url rewriting in Sitecore without relying on <encodeNameReplacements and still allowing "-" as a legal item name character? Our current best idea is to use something slighty more complex than a plain "-" such as "--" or "_". This isn't a very good idea, so I'd appreciate any insight you guys have on the matter.
EDIT: We are running a multi-site setup with Sitecore 6.5
So, if I am assuming correctly, you want to replace spaces in item names with some other SEO-friendlier character. Whatever replacement you configure, you would need to apply the transformation on both sides of the equation (pun intended). So '--' or '_' will have to become illegal item name characters.
I generally think it's a bad idea to do this and would rather have my content editors determine the exact urls to their content.
But if you absolutely need to implement this as a rule, one of the solutions out there is to implement a custom handler to change your real item name while leaving the item display name in its original form.
Hope this helps.
Which version and build of Sitecore are you using?
I've just taken a look through Sitecore.Links.LinkProvider in Sitecore.Kernel and the BuildItemUrl method only replaces characters encodeNameReplacements on the path part of the URL so it should leave your hostname alone.
I would expect the same to happen to reverse, and looking at Sitecore.Pipelines.HttpRequest.SiteResolver the SiteContext is resolved by matching the requested Uri to the defined sites. Further down the process Sitecore.Pipelines.HttpRequest.ItemResolver decodes the Item Url but the Site has already been resolved at this stage.
I presume you have set the hostName attribute on the <site> elements? Having you tried setting the targetHostName attribute as well? Have you tried setting multiple hostNames, I don't expect it to work with spaces but worth a try if what you are saying is true:
<site name="website" hostName="my-site.com|my site.com" ... />
I tried replicating your issue on my local machine but it worked as expected for me... Unless you are working with very strange character sets then this is still the best way of encoding names in my opinion.

Can I use underscore in url instead of hyphen?

I would like to use underscore in my url instead of hyphen.
I mean like this wikipedia link
My Current url:
www.example.com/2013/01/hello-this-is-a-test-post/
Desired url
www.example.com/2013/01/hello_this_is_a_test_post/
But one good programmer in wordpress stackexchange advised me, Google treats - as word separator, but not _.
He also mentioned that rule doesn't apply for MediaWiki sites.
Is it true?
Google treats hyphens as word seperators is TRUE.
The reasoning behind it I recall is based on programmers searching for functions which usually (if not always) have underscores in them. So instead Google treats underscores as word joiners.
This article elaborates: http://www.ecreativeim.com/blog/2011/03/seo-basics-hyphen-or-underscore-for-seo-urls/

How to remove special charecters in wordpress?

I am using Topsy, It returns me title of highest ranking article of my mebsite, It returns me one RSS file which contains post title with there link. For now i am only taking post name and using post title am trying to search in mysql database using following function like this:
get_post_by_title($postTitle,'post');
But the problem is topsy returns me post title but it also add some special characters in RSS file like " ' " replace with " ’ " this charecters.Because of this get_post_by_title() function does not return me post by title name.
EDIT : It returns me one post title like this :
iPad Applications In Bloom’s Taxonomy NEXT
Here single quote is special charecter.
Please help me. Thanks
First let's clear up a misconception: that character in your example is not a "special" character. It is Unicode code point U+2019, "RIGHT SINGLE QUOTATION MARK." Its HTML entity reference is ’. It's an ordinary character - it just happens to be an ordinary character that has no representation in ASCII. Before getting to an answer to your specific question, I need to tell you to read Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" - it is just what it says on the tin, and unless you absorb at least a little more knowledge about Unicode, you will keep running into problems like this. Don't fret too much: everyone runs into problems like this until they learn how to deal with text. Unicode isn't "hard" so much as it is "prone to exposing unconscious assumptions we make about how text works." †
Now, to your question.
If I'm reading you right, what's happening to you is that you have posts with non-ASCII characters in their titles such as ’ which aren't showing up when you search for them with get_post_by_title() (it seems like you're using something similar to the accepted answer on this question - is that right?) There are two paths to a solution: store the titles in a format that's easier for you to search, or use a searching method that can find non-ASCII characters.
Storing the titles differently would require that you run them through PHP's built-in htmlentities() function or before storing them in your Wordpress DB - you would also want to make sure that you convert characters with no HTML entity equivalent to '\xNN' form, and to make sure that your DB's collation/charset is set to UTF-8 or another Unicode-aware encoding. This will be a nontrivial amount of effort. ‡
Using a different searching method doesn't require tinkering with your DB or digging into WordPress internals, but it does require very careful fiddling with search string. You'll need to either use the exact character you're looking for in a search, expressed as a '\xNN' character reference if necessary, or use wildcards carefully in the search.
Either way, good luck. It may be possible to offer more specific advice if more of your code is visible.
†: By the way, your life with regards to Unicode will also get much, much easier if you use better languages than PHP and better databases than MySQL. WordPress is inextricably tied to PHP and MySQL: PHP & MySQL are both woefully, horrendous, hilariously bad at handling Unicode issues correctly. Your life as a programmer will get better if you extirpate PHP & MySQL from it.
‡: Seriously, PHP is atrociously bad at this, and MySQL is in a shoelaces-tied-together state of fumbling. Avoid them.
remove from wp-config.php
//define('DB_CHARSET', 'utf8');
//define('DB_COLLATE','utf8_unicode_ci');
You can easily remove special characters using preg_replace, see this post -> http://code-tricks.com/filter-non-ascii-characters-using-php/

Resources