I am using IMultilanguage2::ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I am getting an escape sequence (e.g. 0x1B, 0x24, 0x29, 0x43 for codepage 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.
I am building a MIME message, so the encoding is specified in the header itself and the escape prefix is displayed as-is.
Is there a way to convert without the prefix?
Thank you!
I don't really see a problem here. That is a valid byte sequence in ISO 2022:
Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
...
Code: ESC $ ) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-designate multibyte 94-set F
Effect: selects a 94n-character set to be used for G1.
As F is 0x43 (C), this byte sequence tells a decoder to switch to ISO-2022-KR:
Character encodings using ISO/IEC 2022 mechanism include:
...
ISO-2022-KR. An encoding for Korean.
ESC $ ) C to switch to KS X 1001-1992, previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
In this case, you have to specify iso-2022-kr as the charset in a MIME Content-Type or RFC2047-encoded header. But an ISO 2022 decoder still has to be able to switch charsets dynamically while decoding, so it is valid for the data to include an intial switch sequence to the Korean charset.
Is there a way to convert without the prefix?
Not with IMultiLanguage2 and WideCharToMultiByte(), no. They have no clue how you are going to use their output, so it makes sense why they include an initial switch sequence to the Korean charset - so a decoder without access to charset info from MIME (or other source) would still know what charset to use initially.
When you put the data into a MIME message, you will have to manually strip off the charset switch sequence when you set the MIME charset to iso-2022-kr. If you do not want to strip it manually, you will have to find (or write) a Unicode encoder that does not output that initial switch sequence.
That was a red herring - turned out the escape sequence is necessary. The problem was with my code that was trimming the names and addresses using Trim() Delphi function, which trims all characters less than or equal to space (0x20); that includes the escape character (0x1B).
Switching to my own trimming function that removes only spaces fixed the problem.
Related
When testing my code that uses a routine that checks for chars to show using an ASCII value routine, my program should drop control chars but keep chars that may be entered by the user. It seems that while the ASCII value routine is called "ascii", it does not just return ascii values: giving it a char of ƒ returns 402.
For example have found this web site
but it doesn't have ƒ 402 that I can see.
Need to know whether there are other ascii codes above 402 that I need to test my code with. The character set used internally by the software that 'ascii' is written in uses UCS2. The web site found doesn't mention USC2.
There are probably many interpretations ouf »Control Character« out there, but I'll assume you mean C0 and C1 control characters (includes references to the relevant Unicode Standards).
The commonly used 32-bit integer representation of Unicode characters in general is the codepoint notation: »U+« followed by a at least 4 digit positive hex number, which you will find near mentions of characters, e.g. as in »U+007F (delete)«. The result of your »ASCII value« routine will probably be this number without the »U+«;
UCS-2 is a specific encoding for Unicode characters, which you probably won't need to care about directly), and is equivalent to Unicode codepoints for all characters within the the range of the BMP only.
I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.
Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.
Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!
UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.
If you do want to write a program for detecting those sequences, it's pretty easy:
Illegal UTF-8 initial sequences
UTF-8 Sequence Reason for Illegality
10xxxxxx illegal as initial byte of character (80..BF)
1100000x illegal, overlong (C0 80..BF)
11100000 100xxxxx illegal, overlong (E0 80..9F)
11110000 1000xxxx illegal, overlong (F0 80..8F)
11111000 10000xxx illegal, overlong (F8 80..87)
11111100 100000xx illegal, overlong (FC 80..83)
1111111x illegal; prohibited by spec
Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.
For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.
The other thing to do is ensure that all continuation octets start with 10.
Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.
My hdump utility is available at: http://david.tribble.com/programs.html
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.