Validating Kana Input - validation

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?

It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.

Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)

oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Related

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

Things to take into account when internationalizing web app to handle chinese language

I have a MVC3 web app with i18n in 4 latin languages... but I would like to add CHINESE in the future.
I'm working with standard resource file.
Any tips?
EDIT: Anything about reading direction? Numbers? Fonts?
I would start with these observations:
Chinese is a non-character-based language, meaning that a search engine (if needed) must not use only punctuation and whitespace to find words (basically, each character is a word); also, you might have mixed Latin and Chinese words
make sure to use UTF-8 for all your HTML documents (.resx files are UTF-8 by default)
make sure that your database collation supports Chinese - or use a separate database with an appropriate collation
make sure you don't reverse strings or do other unusual text operations that might break with multi-byte characters
make sure you don't call ToLower and ToUpper to check user-input text because again this might break with other alphabets (or rather scripts) - aka the Turkey Test
To test for all of the above and other possible issues, a good way is pseudolocalization.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

What does I18N safe mean?

I came across a comment in some code referring to said code being "I18N safe".
What does this refer to?
I + (some 18 characters) + N = InternationalizatioN
I18N safe means that steps were taken during design and development that will facilitate Localization (L10N) at a later point.
This is most often referred to a code or construct ready for I18N - i.e easily supported by common I18N techniques. For instance, the following is ready:
printf(loadResourceString("Result is %s"), result);
while the following is not:
printf("Result is " + result);
because the word order may vary in different languages. Unicode support, international date-time formatting and the like also qualify.
EDIT: added loadResourceString to make an example close to real life.
i18n means internationalization => i (18 letters) n. Code that's marked as i18n safe would be code that correctly handles non-ASCII character data (e.g. Unicode).
Internationalization. The derivation of it is "the letter I, eighteen letters, the letter N".
I18N stands for Internationalization.
its a numeronym for Internationalization.
Different from Acronym, numeronym is a number based word (eg 411 = information, k9 = canine);
In code, this will normally be a folder title, that It generally refers to code that will work in international environments - with different locale, keyboard, character sets etc. ... "
Read more about it here: http://www.i18nguy.com/origini18n.html
i18n is a shorthand for "internationalization". This was coined at DEC and actually uses lowercase i and n.
As a sidenote: L10n stands for "localization" and uses capital L to distinguish it from the lowercase i.
Without any additional information, I would guess that it means the code handles text as UTF8 and is locale-aware. See this Wikipedia article for more information.
Can you be a bit more specific?
i18n-safe is a vague concept. It generally refers to code that will work in international environments - with different locale, keyboard, character sets etc. True i18n-safe code is hard to write.
It means that code cannot rely on:
sizeof (char) == 1
because that character could be a UTF-32 4-byte character, or a UTF-16 2-byte character, and occupy multiple bytes.
It means that code cannot rely on the length of a string equalling the number of bytes in a string. It means that code cannot rely on zero bytes in a string indicating a nul terminator. It means that code cannot simply assume ASCII encoding of text files, strings, and inputs.
I18N stands for Internationalization.
In a nutshell: I18N safe code means that it uses some kind of a lookup table for texts on the UI. For this you have to support non-ASCII encodings. This might seem to be easy, but there are some gotchas.
"I18N safe" coding means the code that doesn't introduce I18N bugs. I18N is a numeronym for Internationalization, where there are 18 characters between I and N.
There are multiple categories of issues related to i18n such as:
Culture Format: Date Time formats(DD/MM/YY in UK and MM/DD/YY in US ), number formats, Timezone, Measuring units change from culture to culture. The data must be accepted, processed and displayed in the right format for the right culture/locale.
International Characters Support: All the characters from all the different languages should be accepted, processed and displayed correctly.
Localizability: The translatable strings should not be hard code. They should be externalized in resource files.
"I18N Safe" coding means that none of the above issue is introduced by the way the code is written.
i18n deals with
- moving hard coded strings out of the code (not all should be by the way) so they can be localized/translated (localization == L10n), as others have pointed out, and also deals with
- locale sensitive method, such as
--methods dealing with text handling (how many words in a Japanese text is far obvious:), order/collation in different languages/writing systems,
--dealing with date/time (the simplest example is showing am/pm for the US, 24 h clocks for France for instance, going to more complex calendars for specific countries),
--dealing with arabic or hebrew (orientation of UI, of text, etc.),
--encoding as others have pointed out
--database issues
it's a fairly comprehensive angle. Just dealing with "String externalization" is far from enough.
Some (software) languages are better than others in helping developers write i18n code (meaning code which will run on different locales), but it remains a software engineering responsibility.

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.
Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.
My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Resources