à © and other codes - utf-8

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.

Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
à © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&Précédent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859

Related

How do I handle encoding for importing foreign movie titles with diacritics into a database?

There may not be one rule to this, but I am trying to find a way to deal with encoding between a PSQL import and retrieving and displaying records in bash or other programs. I am really new to the idea of encoding, so please bear with me! I'm running into issues with the character 'é'. I've gotten the error ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 on import (I believe the default on it is UTF-8) and was able to temporarily fix it by changing the encoding to Windows-1251 on the import. However, trying to retrieve the data in bash gave me the error ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252", so I assumed bash was using 1252 encoding.
I deleted it all and re-imported with WIN1252 encoding, and it worked for both import and retrieval. My concern is whether I may run into issues down the line with displaying this character or trying to retrieve it on a browser. Currently, if I select the movie by id in bash, I get Les MisΘrables. Honestly, not ideal, but it's okay with me if there won't be errors. It did scare me, though, when the query couldn't be completed because of an encoding mismatch.
From my little understanding of UTF-8, I feel like the character 'é' should have been accepted in the first place. I searched online for a character set and that was on it. Can anyone tell me a little more about any of this? My inclination is to go with UTF-8 as it seems the most ubiquitous, but I don't know why it was giving me trouble. Thanks!
EDIT: My complete lack of knowledge surrounding encoding led me to never save the file specifically encoded as UTF-8. That solved it. Thanks to all who looked into this!
Described three different individual problems in the question. Code examples given in Python (commonly comprehensible, I hope).
1st ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 means exactly what said: it's not UTF-8:
>>> print(bytearray([0xe9,0x72,0x61]).decode('latin1'))
éra
2nd ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252". Source is UTF-8, result is Cyrillic ('WIN1251'):
>>> print(bytearray([0xd0,0xb9]).decode('UTF-8'))
й
3rd I get Les MisΘrables (probably instead of Les Misérables?) It's a flagrant mojibake case:
>>> print('Les Misérables'.encode('latin1').decode('cp437'))
Les MisΘrables
In fine, here's a recapitulation of discussed byte values and corresponding characters. Note that the column 'CodePoint' contains Unicode (U+hhhh) and UTF-8 bytes:
Char CodePoint Category Description
---- --------- -------- -----------
é {U+00E9, 0xC3,0xA9} LowercaseLetter Latin Small Letter E With Acute
r {U+0072, 0x72} LowercaseLetter Latin Small Letter R
a {U+0061, 0x61} LowercaseLetter Latin Small Letter A
Ð {U+00D0, 0xC3,0x90} UppercaseLetter Latin Capital Letter Eth
¹ {U+00B9, 0xC2,0xB9} OtherNumber Superscript One
й {U+0439, 0xD0,0xB9} LowercaseLetter Cyrillic Small Letter Short I
Θ {U+0398, 0xCE,0x98} UppercaseLetter Greek Capital Letter Theta

Special character 'Â' inserted before copyright symbol

Our source code contains a copyright at the top of every CSS file...
/* Copyright © ... */
Every time CSS files are loaded by the Firefox Style Editor, a special character is inserted before the copyright symbol...
/* Copyright © ... */
It adds an additional special character each time the file is loaded. I do not believe this is limited to Firefox, but that's what I use at the moment for CSS dynamic styling. It's annoying to have to delete this char every time and occasionally it gets into commits and pushed.
Question: How can the special character insertion be prevented?
Instead of using copyright symbol itself, try to use its numerical number:
©
My advice is to open the files in Notepad++ and check the detected encoding, as displayed under the Encoding menu. I expect that it will read:
Encode in UTF-8
If so, apply Convert to UTF-8-BOM. It will prepend 3 magic bytes to your text file, making the UTF-8 encoding explicit. Save the files and see if it works.
Explanation
The reason for this  to appear, is that some tool is not detecting the encoding correctly and assumes it is ANSI (a.k.a. Windows-1252) or ISO 8859-1. Those one-bytes encodings and UTF-8 are very much alike for normal English texts and code files. The standard ASCII set is encoded in exactly the same way. Only special characters, like in your case, the copyright symbol, are encoded differently, using two, three of four bytes, rather than one.
Now, the copyright symbol has bytes 0xC2 0xA9 or 11000010 10101001 in UTF-8 encoding, and byte 0xA9 in ANSI encoding.
The latin capital letter A with circumflex has byte 0xC2 or 11000010 in ANSI encoding.
When 11000010 10101001 is encountered and interpreted as UTF-8, the first three bits, of the first byte, 110, in combination with the first two bits of the second byte , 10, indicate the start of a two-byte UTF-8 character. So this is the correct UTF-8 encoding of the copyright symbol.
If, however, 11000010 10101001 is encountered and interpreted as ANSI, two separate characters are seen, Â and ©.
I think it is no coincidence that the second byte of the UTF-8 encoding of © is the same as the one-byte ANSI encoding. It looks like the Latin-1 supplement is inserted in UTF-8 at exactly the same order as it has in ANSI and with the same offset, leaving the second bytes equal. E.g. a UTF-8 encoded
µ
would show up as
µ
if wrongly interpreted as ANSI.
Maybe, this was done to preserve some information about the original characters, if an encoding error were made.
Check if you have set a correct charset-meta-tag in your html head
<meta charSet="UTF-8"/>

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

convert text from utf to read-able text

I have some UTF-Text starting with "ef bb bf". How can I turn this message to human read-able text? vim, gedit, etc. interpret the file as plain text and show all the ef-text even when I force them to read the file with several utf-encodings. I tried the "recode" tool, it doesn't work. Even php's utf8_decode failed to produce the expected text output.
Please help, how can I convert this file so that I can read it?
ef bb bf is the UTF-8 BOM. Strip of the first three bytes and try to utf8_decode the remainder.
$text = "\xef\xbb\xbf....";
echo utf8_decode(substr($text, 3));
Is it UFT8, UTF16, UTF32? It matters a lot! I assume you want to convert the text into old-fashioned ASCII (all characters are 1 byte long).
UTF8 should already be (at least mostly) readable as it uses 1 byte for standard ASCII characters and only uses multiple bytes for special/multilingual characters (Character codes > 127). It sounds like your file isn't UTF8, or you'd already be able to read it! Online content is generally UTF-8.
Unicode character codes are the same as the old ASCII codes up to 127.
UTF16 and UTF32 always use 2 and 4 bytes respectively to encode every character, whether those characters can be represented in a single byte or not. That makes it unreadable if the text editor is expecting UTF8.
Gedit supports UTF16 and UTF32 but you need to 'add' those encoding explicitly in the open dialog box (and possibly select them explicitly instead of using auto-detect)

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources