Bytes decoding in D - utf-8

I've got some wrongly decoded text fragment. It was decoded like cp866, but in fact it should be utf-8 ("нажал кабан на баклажан" --> "╨╜╨░╨╢╨░╨╗ ╨║╨░╨▒╨░╨╜ ╨╜╨░ ╨▒╨░╨║╨╗╨░╨╢╨░╨╜"). I'd like to fix it, and I've already written the code in Python which solves the task:
broken = "╨╜╨░╨╢╨░╨╗ ╨║╨░╨▒╨░╨╜ ╨╜╨░ ╨▒╨░╨║╨╗╨░╨╢╨░╨╜"
fixed = bytes(broken, 'cp866').decode('utf-8')
print(fixed) # it will print 'нажал кабан на баклажан'
However, at first I was trying to solve this issue in D, but failed to find an answer. So, how can this task be solved in D?

At the moment, D does not have extensive native facilities for converting text between encodings.
Here are some options:
As ratchet freak mentioned, D does have std.encoding, but it does not cover many encodings at the moment.
On Windows, you could use std.windows.charset.fromMBSz and toMBSz, which wrap MultiByteToWideChar and WideCharToMultiByte.
You could simply embed the encodings that interest you in your program (example).
On POSIX, you could invoke the iconv program (example), or use the libiconv library (D1 binding).

Related

Convert AnsiString to UnicodeString in Lazarus with FreePascal

I found similar topics here but none of them had the solution to my question, so I am asking it in a new thread.
Couple of days ago, I changed the format the preferences of an application I am developing is saved, from INI to JSON.
I use the jsonConf unit for this.
A sample of the code I use to save a key-value pair in the file would be like below.
Procedure TMyClass.SaveSettings();
var
c: TJSONConfig;
begin
c:= TJSONConfig.Create(nil);
try
c.Filename:= m_settingsFilePath;
c.SetValue('/Systems/CustomName', m_customName);
finally
c.Free;
end;
end;
In my code, m_customName is an AnsiString type variable. TJSONConfig.SetValue procedure requires the key and value both to be of UnicodeString type. The application compiles fine, but I get warnings such
Warning: Implicit strung type conversion from "AnsiString" to "UnicodeString".
Some messages warn saying there is a potential data loss.
Of course I can go and change everything to UnicodeString type but this is too risky. I have't seen any issues so far by ignoring these warnings, but they show up all the time and it might cause issues on a different PC.
How do I fix this?
To avoid the warning do an explicit conversion because this way you tell the compiler that you know what you are doing (I hope...). In case of c.SetValue the expected type is a Unicodestring (UTF16), m_customname should be declared as a string unless there is good reason to do differently (see below), otherwise you may trigger unwanted internal conversions.
A string in Lazarus is UTF8-encoded, by default. Therefore, you can use the function UTF8Decode() for the conversion from UTF8 to Unicode, or UTF8ToUTF16() (unit LazUtf8).
var
c: TJSONConfig;
m_customName: String;
...
c.SetValue('/Systems/CustomName', UTF8Decode(m_customName));
You say above that the key-value pairs are in a file. Then the conversion depends on the encoding of the file. Normally I open the file in a good text editor and find the encoding somewhere - NotePad++, for example, displays the name of the encoding in the right corner of the statusbar. Suppose the encoding is that of codepage 1252 (Latin-1). These are ansistrings, therefore, you can declare the strings read from the file as ansistring. Because UTF8 strings are so common in Lazarus there is no direct conversion from ansistring to Unicode, and you must convert to UTF8 first. In the unit lconvencoding you find many conversion routines between various encodings. Select CP1252toUTF8() to go to UTF8, and then apply UTF8Decode() to finally get Unicode.
var
c: TJSONConfig;
m_customName: ansistring;
...
c.SetValue('/Systems/CustomName', UTF8Decode(CP1252ToUTF8(m_customName)));
The FreePascal compiler 3.0 can handle many of these conversions automatically using strings with predefined encodings. But I think explicit conversions are very clear to see what is happening. And fpc3.0 still emits the warnings which you want to avoid...

Undefined procedure error in Prolog when using R-2-L words

I want to make an Arabic morphological analyzer using Prolog.
I have implemented the following code.
check(ي,1,male).
check(ت,1,female).
check(ا,1,me).
dict(لعب,3).
ending('',0,single).
ending(ون,2,plur).
parse([]).
parse(Word,Gender,Verb,Plurality):-
sub_atom(Word,0,LenHead,_,FirstCut),
check(FirstCut,LenHead,Gender),
sub_atom(Word,LenHead,_,LenAfter,Verb),
dict(Verb,LenOfVerb),
Location is LenHead+LenOfVerb,
sub_atom(Word,Location,LenAfter,_,EndOfWord),
ending(EndOfWord,_,Plurality).
This is called using:
parse(يلعب,A,S,D).
Expectation:
A = male
S = لعب
D = single
Explanation of code:
It should parse the word يلعب, note that in Arabic the ي (first letter to the right) indicates that it's masculine word. And لعب is a verb.
Error:
When running the code, I get the following error:
ERROR: parse/4: Undefined procedure: dict/2
Note that when mimicking the Arabic word using English letters, the code behaves as expected and doesn't produce this error.
How can I resolve such error, or make the Prolog understand R-to-L words?
Edit:
In the attached image, note that in the red box, it succeeded to match the ي to male. In the blue box, when it failed, it should have backtracked and starts to concatenate to try to match a new word, but instead it produces the error shown
You have to be careful when you are using SWI-Prolog on the Mac. There is a slight problem with copy paste. If you use [user], and then past multiple lines, it doesn't read all lines:
This happens all the time and isn't related to the arabic script or unicode, or somesuch. I have filed a bug report to SWI Prolog here. When you use [user], and do the lines one by one you get the right result.
In the above screenshot you see that I did a one by one paste, since there are multiple prompts '|:'. Other Prolog systems don't have necessarely this problem, for example I get in Jekejeke Prolog:
Best workaround for SWI-Prolog is probably to store the facts in a file, and consult them from there. In Jekejeke Prolog I have to investigate, why the space after the comma is showing on the wrong side.

does anyone have a libxpm example?

This is the dumbest thing but I have tried and tried and I cannot use libxpm.
I have found some snippets of code but very little and what i have found has been very old code that I cannot compile.
My understanding so far is that I need to:
connect to x windows server (done)
create a window (done)
use libxpm to create a pixmap from data (not done)
copy the pixmap to the window (not done)
If you happen to have a small example lying around or know where to send me that would be great. If you happen to know how to use xcb and libxpm that would be even better. xcb seems to use an integer for it's connection while xlib uses a display struct, I haven't found any examples at all that deal with xcb and libxpm and the connection issue is a deadend for me.
Thanks for reading
This may help. If I use ImageMagick's convert, I can convert a JPEG to an XPM and that may help you see how they are supposed to look.
convert image.jpg image.xpm
And we can look at it like this:
more image.xpm
/* XPM */
static char *background[] = {
/* columns rows colors chars-per-pixel */
"906 603 181 2 ",
" c #040403",
". c #080704",
"X c #070803",
"o c #090906",
"O c #060608",
...
...
The libXpm sources include a simple “show xpm” command, but it still uses Xlib & Xt, not xcb:
https://cgit.freedesktop.org/xorg/lib/libXpm/tree/sxpm/sxpm.c

Weird behaviour with UNICODE text on windows

I have some UNICODE text in my win32 code.
I have declared it something like this..
std::wstring a = Träna; //swedish for practice
I copy that value into a variable using something like...
std::wstring b = a;
While debugging I don't see what im supposed to get in b.
I should be getting Träna in b, but what i get is Träna
This is observed only on windows, the program works fine on OS X.
I'm sure its some rookie mistake, what am i missing here?
As #SigTerm and #jukka said above, the issue was with UTF-8 encoding.
After saving the cpp file in <Unicode UTF-8 with signature> the issue was solved.
The file was earlier saved in <Unicode UTF-8 without signature>.
It wasnt't the issue with prefixing L, i already have defined my variables like that.

How to display UTF-8 in a HITextView (MLTE) control?

I have a UTF-8 string which I want to display in an HITextView (MLTE) control. Theoretically, HITextView requires either "Text" or UTF-16, so I'm converting:
UniChar uniput[STRSIZE];
ByteCount converted,unilen;
err = ConvertFromTextToUnicode(C2UInfo, len, output,
kUnicodeUseFallbacksMask, 0, NULL, 0, NULL,
sizeof(UniChar)*STRSIZE,
&converted, &unilen, uniput);
err=TXNSetData(MessageObject, kTXNUnicodeTextData, uniput, unilen, kTXNEndOffset,
kTXNEndOffset);
I have defined the converter C2UInfo as follows:
UnicodeMapping uMapping;
uMapping.unicodeEncoding = CreateTextEncoding(kTextEncodingUnicodeV2_0,
kUnicodeCanonicalDecompVariant,
kUnicode16BitFormat);
uMapping.otherEncoding = GetTextEncodingBase(kUnicodeUTF8Format);
uMapping.mappingVersion = kUnicodeUseLatestMapping;
err = CreateTextToUnicodeInfo(&uMapping, &C2UInfo);
It works fine for plain old ASCII characters, but multi-byte UTF-8 is being mapped to the wrong characters. For example, æ (LATIN SMALL LETTER AE) is being mapped to 疆 (CJK UNIFIED IDEOGRAPH-7586).
I've tried checking and unchecking "Output Text in Unicode" in Interface Builder, and I've tried varying some of the conversion constants, with no effect.
This is being built with Xcode 3.2.6 using the MacOSX10.5.sdk and tested under 10.6.
The “Text” that ConvertFromTextToUnicode expects is probably the same “Text” that is one of your two options for MLTE. If you had the sort of “Text” that ConvertFromTextToUnicode converts from, you could just pass it to MLTE directly.
(For the record, “Text” is almost certainly either MacRoman or whatever is dictated by the user's locale-determined current script.)
Instead, you should use a Text Encoding Converter. Create one, use it, finish using it, and dispose of it when you're done.
There are two other ways.
One is to create a CFString from the UTF-8, then Get its characters. You would do this instead of using a TEC. It's functionally equivalent and possibly a little bit easier. On the other hand, you don't get to reuse the converter, for whatever that's worth.
The other, since you have an HITextView, would be to create a CFString from the UTF-8 and just use that. Like Cocoa objects, HIToolbox objects have an inheritance hierarchy; since an HITextView is a kind of HIView, HIViewSetText should just work. (And if not, try HIViewSetValue.)
The last method also gets you that much closer to your eventual move away from MLTE/HITextView, since it's essentially what you'll do with an NSTextView. (HITextView and MLTE are deprecated.)

Resources