MATLAB: how to display UTF-8-encoded text read from file? - macos

The gist of my question is this:
How can I display Unicode characters in Matlab's GUI (OS X) so that they are properly rendered?
Details:
I have a table of strings stored in a file, and some of these strings contain UTF-8-encoded Unicode characters. I have tried many different ways (too many to list here) to display the contents of this file in the MATLAB GUI, without success. For example:
>> fid = fopen('/Users/kj/mytable.txt', 'r', 'n', 'UTF-8');
>> [x, x, x, enc] = fopen(fid); enc
enc =
UTF-8
>> tbl = textscan(fid, '%s', 35, 'delimiter', ',');
>> tbl{1}{1}
ans =
ÎÎÎÎÎΠΣΦΩαβγδεζηθικλμνξÏÏÏÏÏÏÏÏÏÏ
>>
As it happens, if I paste the string directly into the MATLAB GUI, the pasted string is displayed properly, which shows that the GUI is not fundamentally incapable of displaying these characters, but once MATLAB reads it in, it longer displays it correctly. For example:
>> pasted = 'ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω'
pasted =
>>
Thanks!

I present below my findings after doing some digging... Consider these test files:
a.txt
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
b.txt
தமிழ்
First, we read files:
%# open file in binary mode, and read a list of bytes
fid = fopen('a.txt', 'rb');
b = fread(fid, '*uint8')'; %'# read bytes
fclose(fid);
%# decode as unicode string
str = native2unicode(b,'UTF-8');
If you try to print the string, you get a bunch of nonsense:
>> str
str =
Nonetheless, str does hold the correct string. We can check the Unicode code of each character, which are as you can see outside the ASCII range (last two are the non-printable CR-LF line endings):
>> double(str)
ans =
Columns 1 through 13
915 916 920 923 926 928 931 934 937 945 946 947 948
Columns 14 through 26
949 950 951 952 953 954 955 956 957 958 960 961 962
Columns 27 through 35
963 964 965 966 967 968 969 13 10
Unfortunately, MATLAB seems unable to display this Unicode string in a GUI on its own. For example, all these fail:
figure
text(0.1, 0.5, str, 'FontName','Arial Unicode MS')
title(str)
xlabel(str)
One trick I found is to use the embedded Java capability:
%# Java Swing
label = javax.swing.JLabel();
label.setFont( java.awt.Font('Arial Unicode MS',java.awt.Font.PLAIN, 30) );
label.setText(str);
f = javax.swing.JFrame('frame');
f.getContentPane().add(label);
f.pack();
f.setVisible(true);
As I was preparing to write the above, I found an alternative solution. We can use the DefaultCharacterSet undocumented feature and set the charset to UTF-8 (on my machine, it is ISO-8859-1 by default):
feature('DefaultCharacterSet','UTF-8');
Now with a proper font (you can change the font used in the Command Window from Preferences > Font), we can print the string in the prompt (note that DISP is still incapable of printing Unicode):
>> str
str =
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
>> disp(str)
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπÏςστυφχψω
And to display it in a GUI, UICONTROL should work (under the hood, I think it is really a Java Swing component):
uicontrol('Style','text', 'String',str, ...
'Units','normalized', 'Position',[0 0 1 1], ...
'FontName','Arial Unicode MS', 'FontSize',30)
Unfortunately, TEXT, TITLE, XLABEL, etc.. are still showing garbage:
As a side note: It is difficult to work with m-file sources containing Unicode characters in the MATLAB editor. I was using Notepad++, with files encoded as UTF-8 without BOM.

Related

How to calculate Content-length properly in tclhttpd?

My Tcl source files are in utf-8. Tclhttpd would not send national characters properly, so I modified it a bit. However, I also send binary stuff like jpg images and sometimes binary chunks are present in my otherwise utf-8 HTML. I have difficulty calculating the proper Content-length to match exactly what the browser receives (otherwise some trailing characters clobber the next-request headers or the browser keeps waiting 30 sec per request, until a timeout).
In other words, can I please know how many bytes did puts $socket write into the socket?
I have discovered a particular 11-byte sequence that messes up counting:
proc dump3 string {
binary scan $string c* c
binary scan $string H* hex
return [sdump $string]\n$c\n$hex
};#dump3
proc Httpd_ReturnData {sock type content {code 200} {close 0}} {
global Httpd
upvar #0 Httpd$sock data
#...skip non-pertinent code...
set content \x4f\x4e\xc2\x00\x03\xff\xff\x80\x00\x3c\x2f
#content=ONÂÿÿ�</
#79 78 -62 0 3 -1 -1 -128 0 60 47
#4f4ec20003ffff80003c2f
puts content=[dump3 $content]
puts utf8=[dump3 [encoding convertto utf-8 $content]]
if {[catch {
puts "string length=[string length $content] type=$type"
puts "stringblength=[string bytelength $content]"
set len [string length $content]
if [string match -nocase *utf-8* $type] {
fconfigure $sock -encoding utf-8
set len [string bytelength $content]
}
puts "len=$len fcon=[fconfigure $sock]"
HttpdRespondHeader $sock $type $close $len $code
HttpdSetCookie $sock
puts $sock ""
if {$data(proto) != "HEAD"} {
##fconfigure $sock -translation binary -blocking $Httpd(sockblock)
##native: -translation {auto crlf}
fconfigure $sock -translation lf -blocking $Httpd(sockblock)
puts -nonewline $sock $content
}
Httpd_SockClose $sock $close
} err]} {
HttpdCloseFinal $sock $err
}
}
The output on console is:
content=ONÂÿÿ�</
79 78 -62 0 3 -1 -1 -128 0 60 47
4f4ec20003ffff80003c2f
utf8=ON�ÿÿ�</
79 78 -61 -126 0 3 -61 -65 -61 -65 -62 -128 0 60 47
4f4ec3820003c3bfc3bfc280003c2f
string length=11 type=text/html;charset=utf-8
stringblength=17
len=17 fcon=-blocking 0 -buffering full -buffersize 16384 -encoding utf-8 -eofchar {{} {}} -translation {auto crlf} -peername {128.0.0.71 128.0.0.71 55305} -sockname {128.0.0.8 gen 8016}
HttpdRespondHeader 17
The resultant Content-Length: 17 is too much, the browser keeps waiting. If I only could know beforehand, how many bytes puts will make out of my string, the rest would be easy. Is there a way?
For data going over HTTP, the content length should be the number of bytes in the data as observed on the wire. When working with Httpd_ReturnData you need to ensure that you provide it the binary data to transfer; it does not handle encoding the data for you.
To send binary data with a length it's actually easy, and you do:
set binaryData [...]
Httpd_ReturnData $sock "application/octet-stream" $binaryData
# There are many other binary encodings; that's just the most universal one
# Choose the right one for your application, of course
To send text data with a length, you need to do a little more work with encoding convertto:
set textData [...]
Httpd_ReturnData $sock "text/plain; charset=utf-8" \
[encoding convertto utf-8 $textData]
# Similarly, text/plain is a decent fallback here too
(Yes, if you choose a different encoding then you should mention that in both places. You probably ought to use UTF-8 for all text content in this day and age.)
If you can pull the data from a file, you should do so; Httpd_ReturnFile is more efficient than Httpd_ReturnData as it can move the data using efficient data transfer techniques. If sending a text file, you need to be careful to describe the encoding of the file correctly. By far the easiest way to do that is by convention, such as deciding that all text files on your system are UTF-8...
You should virtually never use string bytelength, as that reports in units that are one of Tcl's internal-only encodings (a lightly-denormalized almost-UTF-8). The measure it returns is only correct when you're doing something very weird like generating C code that needs to know buffer sizes that contain strings that will be fed into Tcl's implementation, which is very much not what you're doing (I've only done that sort of thing once in more than 20 years of using Tcl; I've never heard of another legitimate use). I believe it is deprecated precisely because it has a bunch of subtle bugs in how it is used by all too many people.

How to restore PDF from ASCII?

I have a question, how to restore PDF file, if all I have is the only ASCII output?
Example:
%PDF-1.3
%���������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x�ѽ
�0�ݧ8O�����[�AAqp� �jK|{S�"�f�2���[�
�(M#���#�FFIw�=*��?J4'�P�y^TP`�Q�
+�i�E�8ψ�g���º��(6�񮽗֭,���s0�T��ZL�~�e�.EA��`J�f��<��M�
[...]
0000120481 00000 n
0000122448 00000 n
trailer
<</Size 94 /Root 57 0 R /Prev 116103 /Info 1 0 R>>
startxref
122488
%%EOF
It's the beginning and end of output I have and I need to restore it back into a readable form. I tried a few things, but I was unlucky.
It is impossible, the information was lost.
You can't represent binary data as a printable text using ASCII encoding in the 'One Byte' to 'One Char' ratio.
There are many non-printable characters in the ASCII table that could be supressed when converting the pdf binary file contents, destroying the original data.
Quoted-Printable encoding and Base64 encoding are more suitable for such application.
Check this out: Binary-to-text_encoding

Wierd output characters (Chinese characters) when using Ruby to read / write CSV

I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.
I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:
week_num type ID location total_qty A_qty B_qty count਍㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㄀㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㄀㈀㔀㌀ഀ
44 small 14 A 907859 907859 0 550360਍㐀㄀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀㄀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㄀㜀㐀ऀ㌀ ㈀㄀
The first few lines of input file are like so:
week_num type ID location total_qty A_qty B_qty count
34 small 197 A 547203 547203 0 91253
44 small 14 A 907859 907859 0 550360
41 small 421 A 302174 302174 0 18198
The strange characters appear to be Line 1 and Line 3 of the data.
Here's my Ruby code:
num_lines=ARGV[0]
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
fw.puts line if outflag
num_lines = num_lines-1
end
Any idea what's going on and what I can do to simply stop at the line end character?
Looking at input/output files in hex (useful suggestion by #user1934428)
Input file - each character looks to be two bytes.
Output file - notice the NULL (00) between each single byte character...
Ruby version 1.9.1
The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.
num_lines=ARGV[0]
# ****** Specifying the right encodings <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")
until (line=fh.gets).nil? or num_lines==0
fw.puts line
num_lines = num_lines-1
end
Useful references:
Working with encodings in Ruby 1.9
CSV encodings
Determining the encoding of a CSV file

Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7

0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Displaying whole ORACLE 8-bit CHARSETS in UNICODE

I maintain an Java EE web application against an eight bits charset oracle database.
The application will be used from abroad and I want to be able to check strings -for example with UNICODE regexps, and both from Java and from Javascript- to see if they fit into the database CHARSET.
One function in GDK -globalization developer kit- gives the equivalent Java name of the oracle charset -I think it was ISO-8859-15-. But I'm not certain the correspondence will be exact.
What I wanted is to display the whole charset -NOT ISO..., but the ORACLE one- char by char to use both from Java and Javascript, even to display the UNICODE points and to tell apart the control characters from printable ones.
There is a funcion in Oracle's GDK to that end?
Thank you.
I think I've found it! (Eureka!)
A little JAVA JDBC program resulted in exactly the characters in ISO-8859-15 that are distintc to ISO-8859-1 (by the way, I've learned that ISO-8859-1 occupies from 0x00 to 0xff in UNICODE).
Program output:
CHR: 164 UNICODE: 8364 euro sign
CHR: 166 UNICODE: 352
CHR: 168 UNICODE: 353
CHR: 180 UNICODE: 381
CHR: 184 UNICODE: 382
CHR: 188 UNICODE: 338
CHR: 189 UNICODE: 339
CHR: 190 UNICODE: 376
Program code (not using GDK at all):
NOTE: the statement "SELECT CHR(i using nchar_cs) FROM DUAL" just gave back the same numbers... WHY?
for(int i=0; i<256; i++)
{
Statement select = con.createStatement();
ResultSet result = select.executeQuery("select CHR(" + i +") from DUAL");
while(result.next())
{
int unicodePoint = result.getString(1).codePointBefore(1);
//int unicodePoint = result.getString(1).codePointAt(0);
if (unicodePoint != i)
System.out.println("CHR: " + i + "\tUNICODE: " + unicodePoint);
}
result.close();
result = null;
select.close();
select = null;
}

Resources