Reading an UTF-8 encoded text file in Mathematica

Reading an UTF-8 encoded text file in Mathematica - wolfram-mathematica

How can I read a utf-8 encoded text file in Mathematica?
This is what I'm doing now:
text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];
but it tells me that
$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"
and so on. I am not sure why. I believe the file is valid utf-8.
Here's the file I'm trying to read:
http://dl.dropbox.com/u/38623/charData.txt

Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...
As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (𠮛) which appears on line 10.
Some versions of the Mathematica front-end (like 8.0.1 on 64-bit Windows 7) can handle the character in question when entered directly:
In[1]:= $c="𠮛";
But we run into trouble if we attempt to create the character from its Unicode:
In[2]:= 134043 // FromCharacterCode
During evaluation of In[2]:= FromCharacterCode::notunicode:
A character code, which should be a non-negative integer less
than 65536, is expected at position 1 in {134043}. >>
Out[2]= FromCharacterCode[134043]
One then wonders, what does Mathematica think the code is for this character?
In[3]:= $c // ToCharacterCode
BaseForm[%, 16]
BaseForm[%, 2]
Out[3]= {55362,57243}
Out[4]//BaseForm= {d842, df9b}
Out[5]//BaseForm= {1101100001000010, 1101111110011011}
Instead of a single Unicode value as one might expect, we get two codes which happen to match the UTF-16 representation of that character. Mathematica can perform the inverse transformation as well:
In[6]:= {55362,57243} // FromCharacterCode
Out[6]= 𠮛
What, then, is Mathematica's conception of the UTF-8 encoding of this character?
In[7]:= ExportString[$c, "Text", CharacterEncoding -> "UTF8"] // ToCharacterCode
BaseForm[%, 16]
BaseForm[%, 2]
Out[7]= {237,161,130,237,190,155}
Out[8]//BaseForm= {ed, a1, 82, ed, be, 9b}
Out[9]//BaseForm= {11101101, 10100001, 10000010, 11101101, 10111110, 10011011}
The attentive reader will spot that this is the UTF-8 encoding of the UTF-16 encoding of the character. Can Mathematica decode this, um, interesting encoding?
In[10]:= ImportString[
ExportString[{237,161,130,237,190,155}, "Byte"]
, "Text"
, CharacterEncoding -> "UTF8"
]
Out[10]= 𠮛
Yes it can! But... so what?
How about the real UTF-8 expression of this character:
In[11]:= ImportString[
ExportString[{240, 160, 174, 155}, "Byte"]
, "Text"
, CharacterEncoding -> "UTF8"
]
Out[11]= $CharacterEncoding::utf8: The byte sequence {240} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {160} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {174} could not be
interpreted as a character in the UTF-8 character encoding. >>
General::stop: Further output of $CharacterEncoding::utf8 will be suppressed
during this calculation. >>
ð ®
... but we see the failure reported in the original question.
How about UTF-16? UTF-16 is not on the list of valid character encodings, but "Unicode" is. Since we have already seen that Mathematica seems to use UTF-16 as its native format, let's give it a whirl (using big-endian UTF-16 with a byte-order-mark):
In[12]:= ImportString[
ExportString[
FromDigits[#, 16]& /# {"fe", "ff", "d8", "42", "df", "9b"}
, "Byte"
]
, "Text"
, CharacterEncoding -> "Unicode"
]
Out[12]= 𠮛
It works. As a more complete experiment, I re-encoded the cited text file from the question into UTF-16 and imported it successfully.
The Mathematica documentation is largely silent on this subject. It is interesting to note that mention of Unicode in Mathematica appears to be accompanied by the assumption that character codes contain 16 bits. See, for example, references to Unicode in Raw Character Encodings.
The conclusion to be drawn from this is that Mathematica's support for UTF-8 transcoding is missing/buggy for codes longer than 16 bits. UTF-16, the apparent internal format of Mathematica, appears to work correctly. So that is a work-around if you are in a position to re-encode your files and you can accept that the resulting strings will actually be in UTF-16 format, not true Unicode strings.
Postscript
A little while after writing this response, I attempted to re-open the Mathematica notebook that contains it. Every occurrence of the problematic character in the notebook had been wiped out and replaced with gibberish. I guess there are yet more Unicode bugs to iron out, even in Mathematica 8.0.1 ;)

Related

Why does Ruby Integer method 'chr' use ASCII-8bit, not UTF-8 by default?

According to it's source code https://www.rubydoc.info/stdlib/core/Integer:chr, this method uses ASCII encoding if no arguments provided, and really, it gives different results when called with and without arguments:
irb(main):002:0* 255.chr
=> "\xFF"
irb(main):003:0' 255.chr 'utf-8'
=> "ÿ"
Why does this happen? Isn't Ruby supposed to use UTF-8 everywhere by default? At least all strings seem to be encoded with UTF-8:
irb(main):005:0> "".encoding
=> #<Encoding:UTF-8>

Why does this happen?
For characters from U+0000 to U+007F (127), the vast majority of single-octet and variable-length character encodings agree on the encoding. In particular, they all agree on being strict supersets of ASCII.
In other words: for characters up to and including U+007F, ASCII, the entire ISO8859 family, the entire DOS codepage family, the entire Windows family, as well as UTF-8 are actually identical. So, for characters between U+0000 and U+007F, ASCII is the logical choice:
0.chr.encoding
#=> #<Encoding:US-ASCII>
127.chr.encoding
#=> #<Encoding:US-ASCII>
However, for anything above 127, more or less no two character encodings agree. In fact, the overwhelming majority of characters above 127 don't even exist in the overwhelming majority of characters sets, thus don't have an encoding in the vast majority of character encodings.
In other words: it is practically impossible to find a single default encoding for characters above 127.
Therefore, the encoding that is chosen by Ruby is Encoding::BINARY, which is basically a pseudo-encoding that means "this isn't actually text, this is unstructured unknown binary data". (For hysterical raisins, this encoding is also aliased to ASCII-8BIT, which I find absolutely horrible, because ASCII is 7 bit, period, and anything using the 8th bit is by definition not ASCII.)
128.chr.encoding
#=> #<Encoding:ASCII-8BIT>
255.chr.encoding
#=> #<Encoding:ASCII-8BIT>
Note also that Integer#chr is limited to a single octet, i.e. to a range from 0 to 255, so multi-octet or variable-length encodings are not really required here.
Isn't Ruby supposed to use UTF-8 everywhere by default?
Which encoding are you talking about? Ruby has about a half dozen of them.
For the vast majority of encodings, your statement is incorrect.
the locale encoding is the default encoding of the environment
the filesystem encoding is the encoding that is used for file paths: the value is determined by the file system
the external encoding of an IO object is the encoding that text that this read is assumed to be in and text that is written is transcoded to: the default is the locale encoding
the internal encoding of an IO object is the encoding that Strings that are written to the IO object must be in and that Strings that are read from the IO object are transcoded into: the default is the default internal encoding, whose default value, in turn, is nil, meaning no transcoding occurs
the script encoding is the encoding that a Ruby script is read, and also String literals in the script will inherit this encoding: it is set with a magic comment at the beginning of the script, and the default is UTF-8
So, as you can see, there are many different encodings, and many different defaults, and only one of them is UTF-8. And none of those encodings are actually relevant to your question, because 128.chr is neither a String literal nor an IO object. It is a String object that is created by the Integer#chr method using whatever encoding it sees fit.

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.

From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?

How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.

Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
Ã © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&PrÃ©cÃ©dent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.

There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio