Can I use underscore in url instead of hyphen? - mod-rewrite

I would like to use underscore in my url instead of hyphen.
I mean like this wikipedia link
My Current url:
www.example.com/2013/01/hello-this-is-a-test-post/
Desired url
www.example.com/2013/01/hello_this_is_a_test_post/
But one good programmer in wordpress stackexchange advised me, Google treats - as word separator, but not _.
He also mentioned that rule doesn't apply for MediaWiki sites.
Is it true?

Google treats hyphens as word seperators is TRUE.
The reasoning behind it I recall is based on programmers searching for functions which usually (if not always) have underscores in them. So instead Google treats underscores as word joiners.
This article elaborates: http://www.ecreativeim.com/blog/2011/03/seo-basics-hyphen-or-underscore-for-seo-urls/

Related

Search using Xapian Omega - with Wild Cards or Regular Expressions

We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)
You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.
James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.

How to remove special charecters in wordpress?

I am using Topsy, It returns me title of highest ranking article of my mebsite, It returns me one RSS file which contains post title with there link. For now i am only taking post name and using post title am trying to search in mysql database using following function like this:
get_post_by_title($postTitle,'post');
But the problem is topsy returns me post title but it also add some special characters in RSS file like " ' " replace with " ’ " this charecters.Because of this get_post_by_title() function does not return me post by title name.
EDIT : It returns me one post title like this :
iPad Applications In Bloom’s Taxonomy NEXT
Here single quote is special charecter.
Please help me. Thanks
First let's clear up a misconception: that character in your example is not a "special" character. It is Unicode code point U+2019, "RIGHT SINGLE QUOTATION MARK." Its HTML entity reference is ’. It's an ordinary character - it just happens to be an ordinary character that has no representation in ASCII. Before getting to an answer to your specific question, I need to tell you to read Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" - it is just what it says on the tin, and unless you absorb at least a little more knowledge about Unicode, you will keep running into problems like this. Don't fret too much: everyone runs into problems like this until they learn how to deal with text. Unicode isn't "hard" so much as it is "prone to exposing unconscious assumptions we make about how text works." †
Now, to your question.
If I'm reading you right, what's happening to you is that you have posts with non-ASCII characters in their titles such as ’ which aren't showing up when you search for them with get_post_by_title() (it seems like you're using something similar to the accepted answer on this question - is that right?) There are two paths to a solution: store the titles in a format that's easier for you to search, or use a searching method that can find non-ASCII characters.
Storing the titles differently would require that you run them through PHP's built-in htmlentities() function or before storing them in your Wordpress DB - you would also want to make sure that you convert characters with no HTML entity equivalent to '\xNN' form, and to make sure that your DB's collation/charset is set to UTF-8 or another Unicode-aware encoding. This will be a nontrivial amount of effort. ‡
Using a different searching method doesn't require tinkering with your DB or digging into WordPress internals, but it does require very careful fiddling with search string. You'll need to either use the exact character you're looking for in a search, expressed as a '\xNN' character reference if necessary, or use wildcards carefully in the search.
Either way, good luck. It may be possible to offer more specific advice if more of your code is visible.
†: By the way, your life with regards to Unicode will also get much, much easier if you use better languages than PHP and better databases than MySQL. WordPress is inextricably tied to PHP and MySQL: PHP & MySQL are both woefully, horrendous, hilariously bad at handling Unicode issues correctly. Your life as a programmer will get better if you extirpate PHP & MySQL from it.
‡: Seriously, PHP is atrociously bad at this, and MySQL is in a shoelaces-tied-together state of fumbling. Avoid them.
remove from wp-config.php
//define('DB_CHARSET', 'utf8');
//define('DB_COLLATE','utf8_unicode_ci');
You can easily remove special characters using preg_replace, see this post -> http://code-tricks.com/filter-non-ascii-characters-using-php/

dashes vs underscores in URL

In URL rewriting, i am quite confused that should i use underscore (_) OR hyphen (-) to replace spaces. According to this and this, hyphen should be used.
But wikipedia uses underscores like en.wikipedia.org/wiki/Computer_network_programming. so which should be used?
use dashes
read from the matt's blog
http://www.mattcutts.com/blog/dashes-vs-underscores/
You can use whatever you want. It doesn't matter (from a technical point of view; there may be other reasons to choose one over the other).

URL Rewriting, SEO and encoding

I found this article regarding URL Rewriting most useful.
But here are a couple of questions.
I would love to use a URL (before rewriting, with spaces inside the query string)
http://www.store.com/products.aspx?category=CD s-Dvd s
First of all, should I replace the spaces with the plus sign (+) for any reason? Like this:
http://www.store.com/products.aspx?category=CD+s-Dvd+s
Secondly, my native language is Greek. Should I encode the parameters? Generally speaking, would the result with URL encoding on be different, regarding S.E.O.?
Actually you should replace spaces with hyphens. That actually is better for SEO than using an underscore.
If the value must come through unaltered, then yes you must use escaping. In a URL query parameter value, a space may be encoded as + or %20. mod_rewrite will generally do this for you as long as the external version was suitably spelled.
In the external version of the URL, only %20 can be used:
http://www.store.com/products/CD%20s-Dvd%20s
http://www.store.com/products.php?category=CD%20s-Dvd%20s
because a + in a URL path part would literally mean a plus.
(Are you sure you want a space there? “CDs-DVDs” without the spaces would seem to be a better title.)
It is non-trivial to get arbitrary strings through from a path part to a parameter. Apart from the escaping issues, you've got problems with /, which should be encoded as %2F in a path part. However Apache will by default block any URL containing %2F for security reasons. (\ is similarly affected under Windows.) You can turn this behaviour off using the AllowEncodedSlashes config, but it means if you want to be portable you can't use “CDs/DVDs” as a category name.
For this reason, and because having a load of %20​s in your URL is a bit ugly, strings are usually turned into ‘slugs’ before being put in a URL, where all the contentious ASCII characters that would result in visible %-escapes are replaced with filler characters such as hyphen or underscore. This does mean you can't round-trip the string, so you need to store either a separate title and slug in the database to be able to look up the right entity for a given slug, or just use an additional ID in the URL (like Stack Overflow does).
General practice is to replace spaces with underscores, ala http://www.store.com/products.aspx?category=CD_s-Dvd_s

Is it possible to create INTERNATIONAL permalinks?

i was wondering how you deal with permalinks on international sites. By permalink i mean some link which is unique and human readable.
E.g. for english phrases its no problem e.g. /product/some-title/
but what do you do if the product title is in e.g chinese language??
how do you deal with this problem?
i am implementing an international site and one requirement is to have human readable URLs.
Thanks for every comment
Characters outside the ISO Latin-1 set are not permitted in URLs according to this spec, so Chinese strings would be out immediately.
Where the product name can be localised, you can use urls like <DOMAIN>/<LANGUAGE>/DIR/<PRODUCT_TRANSLATED>, e.g.:
http://www.example.com/en/products/cat/
http://www.example.com/fr/products/chat/
accompanied by a mod_rewrite rule to the effect of:
RewriteRule ^([a-z]+)/product/([a-z]+)? product_lookup.php?lang=$1&product=$2
For the first example above, this rule will call product_lookup.php?lang=en&product=cat. Inside this script is where you would access the internal translation engine (from the lang parameter, en in this case) to do the same translation you do on the user-facing side to translate, say, "Chat" on the French page, "Cat" on the English, etc.
Using an external translation API would be a good idea, but tricky to get a reliable one which works correctly in your business domain. Google have opened up a translation API, but it currently only supports a limited number of languages.
English <=> Arabic
English <=> Chinese
English <=> Russian
Take a look at Wikipedia.
They use national characters in URLs.
For example, Russian home page URL is: http://ru.wikipedia.org/wiki/Заглавная_страница. The browser transparently encodes all non-ASCII characters and replaces them by their codes when sending URL to the server.
But on the web page all URLs are human-readable.
So you don't need to do anything special -- just put your product names into URLs as is.
The webserver should be able to decode them for your application automatically.
I usually transliterate the non-ascii characters. For example "täst" would become "taest". GNU iconv can do this for you (I'm sure there are other libraries):
$ echo täst | iconv -t 'ascii//translit'
taest
Alas, these transliterations are locale dependent: in languages other than german, 'ä' could be translitertated as simply 'a', for example. But on the other side, there should be a transliteration for every (commonly used) character set into ASCII.
How about some scheme like /productid/{product-id-number}/some-title/
where the site looks at the {number} and ignores the 'some-title' part entirely. You can put that into whatever language or encoding you like, because it's not being used.
If memory serves, you're only able to use English letters in URLs. There's a discussion to change that, but I'm fairly positive that it's not been implemented yet.
that said, you'd need to have a look up table where you assign translations of products/titles into whatever word that they'll be in the other language. For example:
foo.com/cat will need a translation look up for "cat" "gato" "neko" etc.
Then your HTTP module which is parsing those human reading objects into an exact url will know which page to serve based upon the translations.
Creating a look up for such thing seems an overflow to me. I cannot create a lookup for all the different words in all languages. Maybe accessing an translation API would be a good idea.
So as far as I can see its not possible to use foreign chars in the permalink as the sepecs of the URL does not allow it.
What do you think of encoding the specials chars? are those URLs recognized by Google then?

Resources