UTF-8 charecter - UC2-Decimal(150).How to handle this char? - utf-8

In my app some UTF characters like – whose value is UC2-Decimal(150),UC2-Hex(0096),UTF8-Hex(C296) which is fetched from some website, cannot be displayed properly in the browser.
It displays some unwanted characters.
Can anyone help me on this?

I have no idea what UC2-Decimal and UC2-Hex mean, but you seem to be talking about Unicode code point 150 (decimal) or 0x96 (hex). This Unicode character exists but it isn't what you are looking for: it's an obscure control character in the C1 series known as "SPA". Presumably this is the problem!
The character which you actually copied in your question is U+2013, known as "EN DASH", a valid character. Its UTF-8 representation would be the three-byte sequence 0xe2 0x80 0x93.

Related

How do I handle encoding for importing foreign movie titles with diacritics into a database?

There may not be one rule to this, but I am trying to find a way to deal with encoding between a PSQL import and retrieving and displaying records in bash or other programs. I am really new to the idea of encoding, so please bear with me! I'm running into issues with the character 'é'. I've gotten the error ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 on import (I believe the default on it is UTF-8) and was able to temporarily fix it by changing the encoding to Windows-1251 on the import. However, trying to retrieve the data in bash gave me the error ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252", so I assumed bash was using 1252 encoding.
I deleted it all and re-imported with WIN1252 encoding, and it worked for both import and retrieval. My concern is whether I may run into issues down the line with displaying this character or trying to retrieve it on a browser. Currently, if I select the movie by id in bash, I get Les MisΘrables. Honestly, not ideal, but it's okay with me if there won't be errors. It did scare me, though, when the query couldn't be completed because of an encoding mismatch.
From my little understanding of UTF-8, I feel like the character 'é' should have been accepted in the first place. I searched online for a character set and that was on it. Can anyone tell me a little more about any of this? My inclination is to go with UTF-8 as it seems the most ubiquitous, but I don't know why it was giving me trouble. Thanks!
EDIT: My complete lack of knowledge surrounding encoding led me to never save the file specifically encoded as UTF-8. That solved it. Thanks to all who looked into this!
Described three different individual problems in the question. Code examples given in Python (commonly comprehensible, I hope).
1st ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 means exactly what said: it's not UTF-8:
>>> print(bytearray([0xe9,0x72,0x61]).decode('latin1'))
éra
2nd ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252". Source is UTF-8, result is Cyrillic ('WIN1251'):
>>> print(bytearray([0xd0,0xb9]).decode('UTF-8'))
й
3rd I get Les MisΘrables (probably instead of Les Misérables?) It's a flagrant mojibake case:
>>> print('Les Misérables'.encode('latin1').decode('cp437'))
Les MisΘrables
In fine, here's a recapitulation of discussed byte values and corresponding characters. Note that the column 'CodePoint' contains Unicode (U+hhhh) and UTF-8 bytes:
Char CodePoint Category Description
---- --------- -------- -----------
é {U+00E9, 0xC3,0xA9} LowercaseLetter Latin Small Letter E With Acute
r {U+0072, 0x72} LowercaseLetter Latin Small Letter R
a {U+0061, 0x61} LowercaseLetter Latin Small Letter A
Ð {U+00D0, 0xC3,0x90} UppercaseLetter Latin Capital Letter Eth
¹ {U+00B9, 0xC2,0xB9} OtherNumber Superscript One
й {U+0439, 0xD0,0xB9} LowercaseLetter Cyrillic Small Letter Short I
Θ {U+0398, 0xCE,0x98} UppercaseLetter Greek Capital Letter Theta

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

à © and other codes

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.
Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
à © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&Précédent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859

Allowed characters in submit forms (including UTF-8)

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources