what kind of keyboard layout can type ISO 8859-1 Characters? - internationalization

what kind of keyboard layout can type ISO 8859-1 Characters?
Example of what needs to be typed are:-
Ánam àbìa èbèa Ógbuá

First of all: Keyboard layouts and character sets are not directly tied to each other. If I type Ü on my keyboard while in a UTF-8 application, the resulting character will be
a UTF-8 character. If I type it in a ISO-8859-1 application, it will be a character from that character set.
That said, there isn't a keyboard layout that covers all ISO-8859-1 characters; every country layout covers a part of them.
Full list of characters
According to Wikipedia, ISO-8859-1 covers the following languages' special characters in full:
Afrikaans, Albanian, Basque, Breton, Catalan, English (UK and US), Faroese, Galician, German, Icelandic, Irish, (new
- orthography), Italian, Kurdish (The
Kurdish Unified Alphabet), Latin
(basic classical orthography), Leonese,
Luxembourgish (basic classical
orthography), Norwegian (Bokmål and
Nynorsk), Occitan, Portuguese,
Rhaeto-Romanic, Scottish, Gaelic,
Spanish, Swahili, Swedish, Walloon
so you can safely assume that the keyboard layouts of those countries cover a part of ISO-8859-1.

This is what I have decided to do. Hope it puts somebody else on the right footing.
With Special thanks to #Pekka for the patience, guidance and support.
// Replaces combination char with special chars
$phrase = "`U `are ^here tod`ay.";
$search = array("`U", "`a", "^h");
$replace = array("û", "ñ", "à");
$resulte = str_replace($search, $replace, $phrase);
Could be cleaner in a function though

Related

the relationship between Visual Studio "Character Set Configuration" and the encoding scheme?

As Microsoft stated that:
Multibyte character sets, in particular the double-byte character sets (DBCS). Multibyte character sets provide a means to represent the large number of characters in many Asian languages.
DBCS code pages are used for languages such as Japanese and Chinese. In such a code page, some characters have two-byte encodings
So based on above, I have contradicting results: (2 out of 4 all possible cases, and I have three questions under 3 cases out of 4)
So Case 1(Contracditing):
I asumme When I choose Use Multi-Byte Character Set, the following will automatically choose DBCS encoding:
string chineseString = "我是路人";
but instead compiler said:
warning C4566: character represented by universal-character-name '\u6211' cannot be represented in the current code page (1252)
which is contradicting the config itself, because 1252 is only western language encoding. Isn't is supposed to use the MBCS/DBCS here?
Case 2 (Understandble, non-contradicting):
I choose "Use Unicode Character Set"
Now I assume I have to specify an encoding, so I will do like this:
string chineseString = u8"我是路人"
which works and makes sense for me.
Case 3(Contracdicting):
I choose "Use Multi-Byte Character Set":
wstring chineseStringW = L"我是路人"
so is now using the encoding DBCS? If so, why string does not pick up DBCS? or just because \u6211 fits in wchar_t?
Case 4:
I choose "Use Unicode Character Set":
wstring chineseStringW = L"我是路人"
so is it now the encoding UTF16-LE?

How do I filter out invisible characters without affecting Japanese character set?

I noticed that some of my input is getting U+2028. I don't know what this is, but how can I prevent this with consideration of UTF-8 and English/Japanese characters?
The character U+2028 is LINE SEPARATOR and is one of space characters.
To select only the Japanese characters is (I am afraid) quite tricky in the Unicode space, because CJK characters spread all over across so many planes, even though Ruby supports an extensive Unicode category format in Regexp like \p{Hiragana}. However, if your only interest is Japanese and ASCII, the NKF library is useful. Here is an example:
require 'nkf'
orig = "b2αÇ()あ相〜\u2028\u3000_━●★】"
p orig
p NKF.nkf('-w -E', NKF.nkf('-e', orig))
# =>
# "b2αÇ()あ相〜\u2028 _━●★】"
# "b2α()あ相〜 _━●★】"
As you see, the unicode character U+2028 is filtered out, whereas a Greek character "α" is preserved because it is included in the Japanese JIS-X-0208 code. Note the accented alphabets like "Ç" are filtered out, because they are not included. The set of so-called hankaku-kana is filtered out (Edited-from) converted into zenkaku-kana (Edited-to) in this formula. The JIS-X-0212 character set is not supported, either.
A solution for your specific case.
I have come up with other solutions (for Ruby 2) in addition to the solution with the NKF library. The comparison as described below is in a way interesting, as they are slightly different from one another. This is a major revision, and so I am posting it as a separate answer. I am also describing the background about this at the end of this post.
I am assuming the original input is in UTF-8 encoding except for the first section (if not, convert it to the UTF-8 first to apply any of the examples).
Solutions to filtering out illegitimate characters
"illegitimate" means the character code that is not included in the encoding defined for a String instance.
In Ruby 2, such String usually should have the encoding ASCII-8BIT. However, some may wrongly have UTF-8 encoding.
If it has the encoding ASCII-8BIT, but if you want to get a legitimate UTF-8 String,
s1 = String.new("あ\x99", encoding: 'ASCII-8BIT') # An example ASCII-8BIT
# => "\xE3\x81\x82\x99"
s1.encoding # => #<Encoding:ASCII-8BIT>
s1.valid_encoding? # => true because 'ASCII-8BIT' accepts anything.
s1.force_encoding('UTF-8')
# => s1=="あ\x99"
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
If it has wrongly the encoding UTF-8, and if you want to filter out the illegitimate codepoints,
s1 = String.new("あ\x99", encoding: 'UTF-8') # An example 'UTF-8'
# => "あ\x99"
s1.encoding # => #<Encoding:UTF-8>
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
Solutions to filtering out "non-Japanese" characters
All the following methods are to filter out "non-Japanese" characters.
Basically, "non-Japanese" characters are those that are not included in one or more of the traditional standard of the Japanese character set.
See the next section for the detailed background of the definition of the "non-Japanese" characters.
The strategy here is to convert the encoding of the original String to a Japanese JIS encoding (referred to as ISO-2022-JP or EUC-JP; basically JIS-X-0208) and to convert back to UTF-8.
Use String#encode
Ruby-2 built-in String#encode does the exact job.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
print "Orig:"; p orig
print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8')
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": filtered out
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: filtered out
Circled Digit: filtered out
Unicode Line Separator: filtered out
Use NKF library
The NKF library is one of the standard libraries that come with the official Ruby release.
The library is traditional and has been used for decades; cf., NKF stands for Network Kanji Filter.
It does a very similar, though slightly different, job to/from Ruby Encoding.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'nkf'
print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig))
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": converted into "zenkaku" (aka full-width)
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: preserved
Circled Digit: preserved
Unicode Line Separator: filtered out
Use iconv Gem
Ruby Gem iconv does not come with the standard Ruby anymore (I think it used to, up to Ruby 2.1 or something). But you can easily install it with the gem command like gem install iconv .
It can handle ISO-2022-JP-2, unlike the above-mentioned 2 methods, which may be handy (n.b., the encoding ISO-2022-JP-2 is actually defined in Ruby Encoding, but no conversion is defiend for or from it in Ruby in default). Once installed, the following is an example.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'iconv'
output = ''
Iconv.open('iso-2022-jp-2', 'utf-8') do |cd|
cd.discard_ilseq=true
output = cd.iconv orig << cd.iconv(nil)
end
s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output)
print "Icon:"; p s2
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": preserved
Euro-sign: preserved
Latin1: preserved
JISX0212: preserved
CJK Compatibility: filtered out
Circled Digit: preserved
Unicode Line Separator: filtered out
Summary
Here are the outputs of the above-mentioned three methods:
Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク"
Enc: "b2◇〒α()あ相〜 _8D━●★】$£"
NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク"
Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク"
All the code snippets above here are available as a gist in Github for convenience — download or git clone and run it.
Background
What is an invalid character? The character U+2028, for example as in the question, is a legitimate UTF-8 character (Line Separator). So, there is no general reason to filter such characters out, though some individual situations may require to.
What is an English character? The lower- and upper-case alphabets (52 in total) probably are. Then, how about the dollar sign ($)? Pound sign (£)? Euro sign (€)? The dollar sign is an ASCII character, whereas neither of the pound and Euro signs is not. The pound sign is included in the traditional Latin-1 (ISO-8859-1) character set, whereas the Euro sign is not. As such, what is an English character is not a trivial question.
You may define ASCII (or Latin-1, or whatever) is the only English character set in your definition, but it is somewhat arbitrary.
What is a Japanese character? OK, Hiragana and Katakana are unique to Japanese. How about Kanji? Do you accept simplified Chinese characters, which are not used in Japan, as Kanji? How about symbols? OK, a few symbols, such as 。 (U+3002; Ideographic Full Stop) and 「 (U+300c; Left Corner Bracket) are essential punctuations in the Japanese text. But, is there any reason to regard characters like ▼ (Black Down-Pointing Triangle), which has been used widely among Japanese-language computer users for decades, as Japanese specific? Perhaps not. They are just symbols that can be used anywhere in the world. And worse, it is not a clear cut; for example, although it is perhaps fair to argue Postal Mark 〒 is Japanese specific, it is not an essential punctuation like the full stop but just a symbol fairly popularly used in Japan. I would not be surprised if the very similar symbol is actually used elsewhere in the world, unknown to me.
Being similar to the argument of ASCII and Latin-1 for English characters, you could define the traditionally used characters included in the JIS (X 0208) character set are the valid Japanese characters. Again, it is inevitably arbitrary. For example, the Pound sign (£) is included in it, whereas Euro sign is not. The diamond mark ◇ (White Diamond) is included, whereas the heart mark ♡ (White Heart Suit) is not. Or, what about those so-called "zenkaku" (aka full-width) characters, which are just duplications of alphabets and Arabic numerals of 0 to 9 of ASCII?
After all, the Unicode is the unified set of the characters used in the world regardless of the languages (— well, ideally at least, though you may argue the real Unicode is not quite idealistic). In this sense there is no definite answer to filter out non-English or non-Japanese characters. Consequently, the original question about filtering out U+2028 is one of those arbitrary demands coming from some specific situations, even though it can well be a popular demand in fact (and hence my answer).
Only the definitive thing you could do is to filter out illegitimate characters for the chosen character encoding, such as UTF-8, as described in the first section of this answer. The rest is, really, up to each individual's need in their specific situations.
Background of the "Japanese" character sets
The Japanese character set was traditionally defined in the JIS standards in the official term. Specifically, JIS-X-0208 and much less popular JIS-X-0212 (often casually called "補助漢字") are the two standards (n.b., they have their specific details like 1983 and 1990). Unfortunately, in practice, NEC, Microsoft and Apple adopted their own variations (called broadly Shift_JIS or SJIS, though each has their own variation). Due to the popularity of their OSs, they were (and to some extent still are(!)) more widely used in Japan in reality than the strict official ones before the era where the UTF-8 is widely accepted.
Note that all of them accept the ASCII at least. So, it has been always safe to use ASCII in pretty much any situations (excepting some in early 80s or before).
The Unicode is very inclusive, containing pretty much any of the characters that have been defined in any of these character codesets. That means any of the characters that have once stirred hot debate (whether you should not use or you may) can be legitimately used in (any of) the Unicode encoding now – I mean legitimate as far as the character encoding is concerned.
I presume this confused practical situation has lead to the results as shown above that slightly differ from one another, depending which method you use. Pick your favourite, depending on your need!

Which characters in UTF8 can have upper/lower case pair?

I know of a-z (A-Z), å (Å), ä (Ä), ö (Ö). But is there any official definition of which characters actually have a sibling in another case?
This is language specific. But do check out Case Mappings. This is part of the standard.
5.18 Case Mappings
Case is a normative property of characters in specific alphabets such as Latin, Greek, Cyrillic,
Armenian, and archaic Georgian, whereby characters are considered to be variants of a
single letter.
You may also want to check the European Alphabetic Scripts part for language specific information.

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

Allowed characters in submit forms (including UTF-8)

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Resources