Convert AnsiString to UnicodeString in Lazarus with FreePascal - lazarus

I found similar topics here but none of them had the solution to my question, so I am asking it in a new thread.
Couple of days ago, I changed the format the preferences of an application I am developing is saved, from INI to JSON.
I use the jsonConf unit for this.
A sample of the code I use to save a key-value pair in the file would be like below.
Procedure TMyClass.SaveSettings();
var
c: TJSONConfig;
begin
c:= TJSONConfig.Create(nil);
try
c.Filename:= m_settingsFilePath;
c.SetValue('/Systems/CustomName', m_customName);
finally
c.Free;
end;
end;
In my code, m_customName is an AnsiString type variable. TJSONConfig.SetValue procedure requires the key and value both to be of UnicodeString type. The application compiles fine, but I get warnings such
Warning: Implicit strung type conversion from "AnsiString" to "UnicodeString".
Some messages warn saying there is a potential data loss.
Of course I can go and change everything to UnicodeString type but this is too risky. I have't seen any issues so far by ignoring these warnings, but they show up all the time and it might cause issues on a different PC.
How do I fix this?

To avoid the warning do an explicit conversion because this way you tell the compiler that you know what you are doing (I hope...). In case of c.SetValue the expected type is a Unicodestring (UTF16), m_customname should be declared as a string unless there is good reason to do differently (see below), otherwise you may trigger unwanted internal conversions.
A string in Lazarus is UTF8-encoded, by default. Therefore, you can use the function UTF8Decode() for the conversion from UTF8 to Unicode, or UTF8ToUTF16() (unit LazUtf8).
var
c: TJSONConfig;
m_customName: String;
...
c.SetValue('/Systems/CustomName', UTF8Decode(m_customName));
You say above that the key-value pairs are in a file. Then the conversion depends on the encoding of the file. Normally I open the file in a good text editor and find the encoding somewhere - NotePad++, for example, displays the name of the encoding in the right corner of the statusbar. Suppose the encoding is that of codepage 1252 (Latin-1). These are ansistrings, therefore, you can declare the strings read from the file as ansistring. Because UTF8 strings are so common in Lazarus there is no direct conversion from ansistring to Unicode, and you must convert to UTF8 first. In the unit lconvencoding you find many conversion routines between various encodings. Select CP1252toUTF8() to go to UTF8, and then apply UTF8Decode() to finally get Unicode.
var
c: TJSONConfig;
m_customName: ansistring;
...
c.SetValue('/Systems/CustomName', UTF8Decode(CP1252ToUTF8(m_customName)));
The FreePascal compiler 3.0 can handle many of these conversions automatically using strings with predefined encodings. But I think explicit conversions are very clear to see what is happening. And fpc3.0 still emits the warnings which you want to avoid...

Related

How to add a hexadecimal watch in CLion?

I need to add a watch in hexadecimal format in CLion.
ltoa(variable, 16) doesn't work, at least, on my system.
In Java/Python, I can have a workaround: write a custom toString()/__str__ for my class and have it displayed the way I need. gdb has p/x. How do I do it in CLion?
Edit: ltoa(variable, 16) works if I define ltoa in my code, as it's not always present in standard library.
set output-radix 16
You can set this as a default option in a file called .gdbinit, which you can put in your home directory, or the working directory from which you start GDB (project root, for instance).
They added hex view as an experimental feature: Hexadecimal view
To enable:
Invoke the Maintenance popup: press Ctrl+Alt+Shift+/, or call Help | Find Action and search for Maintenance. Choose Experimental features.
Select the cidr.debugger.value.numberFormatting.hex checkbox
Go to Settings / Preferences | Build, Execution, Deployment | Debugger | Data Views | C/C++ and set the checkbox Show hex values for numbers. Choose to have hexadecimal values displayed instead or alongside the original values.
Now the hexadecimal formatting is shown both in the Variables pane of the Debug tool window and in the editor's inline variables view.
...after refining the formulation, I see it.
I wrote my own char *lltoa(long long value, int radix) function. I can use it in watches now.
Update: in the respective feature request, Chris White found a workaround on OS X with lldb:
I decided to do a bit more digging and found a way to set lldb on OS X
to force HEX output for unsigned char data types:
​type format add –format hex "unsigned char"
If you want to make this setting persistent you can also create a
.lldbinit file and add this command to it. Once you do this CLion
will display this data type in HEX format.
This makes ALL the variables of this type display in hex.
Update 2: My first workaround is pretty dirty, here's a better one.
You can assign formats to more specific types. The debugger keeps track of type inheritance. So, adding a hex format to uint8_t will not affect unsigned char. You can fine-tune the displays.
You can assign formats to structs also. Here's an example from my .lldbinit:
type format add --format dec int32_t
# https://lldb.llvm.org/varformats.html
type summary add --summary-string "addr=${var.address} depth=${var.depth}" Position

Predefined Windows icons: Unicode

I am assigning to the lpszIcon member of the MSGBOXPARAMSW structure(notice the W). I want to use one of the predefined icons like IDI_APPLICATION or IDI_WARNING but they are all ASCII (defined as MAKEINTRESOURCE). I tried doing this:
MSGBOXPARAMSW mbp = { 0 };
mbp.lpszIcon = (LPCWSTR) IDI_ERROR;
but then no icon displayed at all. So how can I use the unicode versions of the IDI_ icons?
There is no ANSI or Unicode variant of a numeric resource ID. The code that you use to set lpszIcon is correct. It is idiomatic to use the MAKEINTRESOURCE macro rather than a cast, but the cast has identical meaning. Your problem lies in the other code, the code that we cannot see.
Reading between the lines, I think that you are targeting ANSI or MBCS. You tried to use MAKEINTRESOURCE but that expands to MAKEINTRESOURCEA. That's what led you to cast. You should have used MAKEINTRESOURCEW to match MSGBOXPARAMSW. That would have resolved the compilation error you encountered. You could equally have changed the project to target UNICODE.
But none of that explains why the icon does not appear in the dialog. There has to be a problem elsewhere. If the dialog appears then the most likely explanation is that you have set hInstance to a value other than NULL. But the code to set lpszIcon is correct, albeit not idiomatic.

How to display UTF-8 in a HITextView (MLTE) control?

I have a UTF-8 string which I want to display in an HITextView (MLTE) control. Theoretically, HITextView requires either "Text" or UTF-16, so I'm converting:
UniChar uniput[STRSIZE];
ByteCount converted,unilen;
err = ConvertFromTextToUnicode(C2UInfo, len, output,
kUnicodeUseFallbacksMask, 0, NULL, 0, NULL,
sizeof(UniChar)*STRSIZE,
&converted, &unilen, uniput);
err=TXNSetData(MessageObject, kTXNUnicodeTextData, uniput, unilen, kTXNEndOffset,
kTXNEndOffset);
I have defined the converter C2UInfo as follows:
UnicodeMapping uMapping;
uMapping.unicodeEncoding = CreateTextEncoding(kTextEncodingUnicodeV2_0,
kUnicodeCanonicalDecompVariant,
kUnicode16BitFormat);
uMapping.otherEncoding = GetTextEncodingBase(kUnicodeUTF8Format);
uMapping.mappingVersion = kUnicodeUseLatestMapping;
err = CreateTextToUnicodeInfo(&uMapping, &C2UInfo);
It works fine for plain old ASCII characters, but multi-byte UTF-8 is being mapped to the wrong characters. For example, æ (LATIN SMALL LETTER AE) is being mapped to 疆 (CJK UNIFIED IDEOGRAPH-7586).
I've tried checking and unchecking "Output Text in Unicode" in Interface Builder, and I've tried varying some of the conversion constants, with no effect.
This is being built with Xcode 3.2.6 using the MacOSX10.5.sdk and tested under 10.6.
The “Text” that ConvertFromTextToUnicode expects is probably the same “Text” that is one of your two options for MLTE. If you had the sort of “Text” that ConvertFromTextToUnicode converts from, you could just pass it to MLTE directly.
(For the record, “Text” is almost certainly either MacRoman or whatever is dictated by the user's locale-determined current script.)
Instead, you should use a Text Encoding Converter. Create one, use it, finish using it, and dispose of it when you're done.
There are two other ways.
One is to create a CFString from the UTF-8, then Get its characters. You would do this instead of using a TEC. It's functionally equivalent and possibly a little bit easier. On the other hand, you don't get to reuse the converter, for whatever that's worth.
The other, since you have an HITextView, would be to create a CFString from the UTF-8 and just use that. Like Cocoa objects, HIToolbox objects have an inheritance hierarchy; since an HITextView is a kind of HIView, HIViewSetText should just work. (And if not, try HIViewSetValue.)
The last method also gets you that much closer to your eventual move away from MLTE/HITextView, since it's essentially what you'll do with an NSTextView. (HITextView and MLTE are deprecated.)

convert case of wide characters, given the LCID (Visual C++)

I have some existing Visual C++ code where I need to add the conversion of wide character strings to upper or lower case.
I know there are pitfalls to this (such as the Turkish "I"), but most of these can be ironed-out if you know the language. Fortunately in this area of code I know the LCID value (locale ID) which I guess is the same as knowing the language.
As LCID is a Windows type, is there a Windows function that will convert wide strings to upper or lower case?
The C runtime function _towupper_l() sounds like it would be ideal but it takes a _locale_t parameter instead of LCID, so I guess it's unsuitable unless there is a completely reliable way of converting an LCID to a _locale_t.
The function you're searching for is called LCMapString and it is part of the Windows NLS APIs. The LCMAP_UPPERCASE flag maps characters to uppercase, while the LCMAP_LOWERCASE maps characters to lowercase.
For applications targeting Windows Vista and later, there is an Ex variant that works on locale names instead of identifiers, which are what Microsoft now says you should prefer to use.
In fact, in the CRT implementation provided with VS 2010 (and presumably other versions as well), functions such as _towupper_l ultimately end up calling LCMapString after they extract the locale ID (LCID) from the specified _locale_t.
If you're like me, and less familiar with the i8n APIs than you should be, you probably already know about the CharUpper, CharLower, CharUpperBuff, and CharLowerBuff family of functions. These have been the old standbys from the early days of Windows for altering the case of chars/strings, but as their documentation warns:
Note that CharXxx always maps uppercase I to lowercase I ("i"), even when the current language is Turkish or Azeri. If you need a function that is linguistically sensitive in this respect, call LCMapString.
What it neglects to mention is filled in by a couple of posts on Michael Kaplan's wonderful blog on internationalization issues: What does "linguistic casing" mean?, How best to alter case. The executive summary is that you achieve the same results as the CharXxx family of functions by calling LCMapString and not specifying the LCMAP_LINGUISTIC_CASING flag, whereas you can be linguistically sensitive by ensuring that you do specify the LCMAP_LINGUISTIC_CASING flag.
Sample code:
std::wstring test("Does my code pass the Turkey test?");
if (!LCMapStringW(lcid, /* your LCID, defined elsewhere */
LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING,
test.c_str(), /* input string */
test.length(), /* length of input string */
&test[0], /* output buffer (can reuse input) */
test.length())) /* length of output buffer (same as input) */
{
// Uh-oh! Something went wrong in the call to LCMapString, so you need to
// handle the error somehow here.
// A good start is calling GetLastError to determine the error code.
}

How can I programmatically determine the current default codepage of Windows?

I have to convert the encoding of a string output of a VB6 application to a specific encoding.
The problem is, I don't know the encoding of the string, because of that:
According to the VB6 documentation when accessing certain API functions the internal Unicode strings are converted to ANSI strings using the default codepage of Windows.
Because of that, the encoding of the string output can be different on different systems, but I have to know it to perform the conversion.
How can I read the default codepage using the Win32 API or - if there's no other way - by reading the registry?
It could be even more succinct by using GetACP - the Win32 API call for returning the default code page! (Default code page is often called "ANSI")
int nCodePage = GetACP();
Also many API calls (such as MultiByteToWideChar) accept the constant value CP_ACP (zero) which always means "use the system code page". So you may not actually need to know the current code page, depending on what you want to do with it.
GetSystemDefaultLCID() gives you the system locale.
If the LCID is not enough and you truly need the codepage, use this code:
TCHAR szCodePage[10];
int cch= GetLocaleInfo(
GetSystemDefaultLCID(), // or any LCID you may be interested in
LOCALE_IDEFAULTANSICODEPAGE,
szCodePage,
countof(szCodePage));
nCodePage= cch>0 ? _ttoi(szCodePage) : 0;
That worked for me, thanks, but can be written more succinctly as:
UINT nCodePage = CP_ACP;
const int cch = ::GetLocaleInfo(LOCALE_SYSTEM_DEFAULT,
LOCALE_RETURN_NUMBER|LOCALE_IDEFAULTANSICODEPAGE,
(LPTSTR)&nCodePage, sizeof(nCodePage) / sizeof(_TCHAR) );

Resources