I've done some research and can't find any way to transliterate Japanese.
Stringex would have been the best bet, but it's a known issue that it treats Japanese as Mandarin. I've verified this happens even when there are obvious Japanese characters present (ie kana). Same for Unidecoder.
I realise it's a hard problem as Kanji phonetics are ambiguous, but this is for URL slugs so it doesn't have to be perfect.
I've never tried it, but you could try this:
https://github.com/makimoto/romaji
"Romaji" is basically the name for system the Japanese themselves use to write in Roman/Latin characters. I am not certain if transliterations targetted at romaji for Japanese speakers will be what you need if you are targetting non-Japanese speakers, but it seems like it might be good.
So you could try romaji in your google searching, that's what I did. Since ruby, of course, originated in Japan, I figured there'd be something around for it.
Related
I am using Topsy, It returns me title of highest ranking article of my mebsite, It returns me one RSS file which contains post title with there link. For now i am only taking post name and using post title am trying to search in mysql database using following function like this:
get_post_by_title($postTitle,'post');
But the problem is topsy returns me post title but it also add some special characters in RSS file like " ' " replace with " ’ " this charecters.Because of this get_post_by_title() function does not return me post by title name.
EDIT : It returns me one post title like this :
iPad Applications In Bloom’s Taxonomy NEXT
Here single quote is special charecter.
Please help me. Thanks
First let's clear up a misconception: that character in your example is not a "special" character. It is Unicode code point U+2019, "RIGHT SINGLE QUOTATION MARK." Its HTML entity reference is ’. It's an ordinary character - it just happens to be an ordinary character that has no representation in ASCII. Before getting to an answer to your specific question, I need to tell you to read Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" - it is just what it says on the tin, and unless you absorb at least a little more knowledge about Unicode, you will keep running into problems like this. Don't fret too much: everyone runs into problems like this until they learn how to deal with text. Unicode isn't "hard" so much as it is "prone to exposing unconscious assumptions we make about how text works." †
Now, to your question.
If I'm reading you right, what's happening to you is that you have posts with non-ASCII characters in their titles such as ’ which aren't showing up when you search for them with get_post_by_title() (it seems like you're using something similar to the accepted answer on this question - is that right?) There are two paths to a solution: store the titles in a format that's easier for you to search, or use a searching method that can find non-ASCII characters.
Storing the titles differently would require that you run them through PHP's built-in htmlentities() function or before storing them in your Wordpress DB - you would also want to make sure that you convert characters with no HTML entity equivalent to '\xNN' form, and to make sure that your DB's collation/charset is set to UTF-8 or another Unicode-aware encoding. This will be a nontrivial amount of effort. ‡
Using a different searching method doesn't require tinkering with your DB or digging into WordPress internals, but it does require very careful fiddling with search string. You'll need to either use the exact character you're looking for in a search, expressed as a '\xNN' character reference if necessary, or use wildcards carefully in the search.
Either way, good luck. It may be possible to offer more specific advice if more of your code is visible.
†: By the way, your life with regards to Unicode will also get much, much easier if you use better languages than PHP and better databases than MySQL. WordPress is inextricably tied to PHP and MySQL: PHP & MySQL are both woefully, horrendous, hilariously bad at handling Unicode issues correctly. Your life as a programmer will get better if you extirpate PHP & MySQL from it.
‡: Seriously, PHP is atrociously bad at this, and MySQL is in a shoelaces-tied-together state of fumbling. Avoid them.
remove from wp-config.php
//define('DB_CHARSET', 'utf8');
//define('DB_COLLATE','utf8_unicode_ci');
You can easily remove special characters using preg_replace, see this post -> http://code-tricks.com/filter-non-ascii-characters-using-php/
(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-823
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).
Normally I use Recaptcha for all captcha purposes, but now I'm building a website that is translated into Chinese and Japanese, among other languages. I'd like to make the captcha as accessible to those users as possible. Even if they can read and type English characters (which is not necessarily the case), often times even I as an English-speaker have had trouble figuring out what the word in Recaptcha has to be.
One good solution I've seen (from Google) is to use numbers instead of text. Are there other good solutions? Is there a reliable free captcha service out there such as Recaptcha that offers this option?
The Chinese and Japanese both use a keyboard with Latin characters on. The Chinese input their 1000s of characters via Pinyin (Romanized Chinese) and so they are very familiar with all the same letters that you and I are. Therefore, whatever you are using for English speaking people can also be used for them.
PS - I know this is an answer to an old post, but I'm hoping this answer will help anyone who comes here with the same question.
I have encountered the same problem in the past, I resolved the issue by using the following CAPTCHA which uses a numerical validation:
http://www.tipstricks.org/
However, this may not be the best solution for you, so here is an extensive list of different CAPTCHAs you might want to consider (most of them are text based, but some use alternative methods such as numerical expressions):
http://captcha.org/
Hope this helps
When I see a small program which is written for some students, I often see something like this: (haskell, german):
ueber = "What the haeck!"
instead of
über = "What the häck!"
As many modern languages are specified to allow non-standard charactes in declaration names via UTF-8, is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters (say for a team of german students?) or is this just a historical reason?
I know, that you should keep names in a-zA-Z_0-9 if you develop an applicaio internationally, but are there any reason for avoiding this in a "local" project?
is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters
That is certainly the main reason. Other reasons that come to mind is that many development tools, search functions, editors, parsers, documentors, code search engines etc. will not expect non-ASCII input in code.
Also, you never know where your code may be used one day! The smallest innocent school project can grow into a nice Open-Source tool that gets used around the globe one day. In that case, ASCII is the smallest common denominator, at least at the moment.
I've had to work on a project started by French developers. They had to spend quite a bit of time translating their program to English when more people joined the project. Teach your German students this lesson up front, and not only will they be able to share their code with others, they'll no longer need an über or ueber variable either.
BTW, ü is an alphabetic character. + and - are non-alphanumeric, and I'd say it's obvious why they're disliked in function names.
I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.