Getting the default RTL codepage in Lazarus - utf-8

Lazarus Wiki states
Lazarus (actually its LazUtils package) takes advantage of that API
and changes it to UTF-8 (CP_UTF8). It means also Windows users now use
UTF-8 strings in the RTL
In our cross-platform and cross-compiler code, we'd like to detect this specific situation. GetACP() Windows API function still returns "1252", and so does GetDefaultTextEncoding() function in Lazarus. But the text (specifically, the filename returned by FindFirst() function) contains the string with UTF8-encoded filename, and the codepage of the string (variable) is 65001 too.
So, how do we figure out that the RTL operates with UTF8 strings by default? I've spent several hours trying to figure this out from Lazarus source code, but probably I am missing something ...
I understand that in many scenarios, we need to inspect the codepage of each specific string, but I am interested in the way to find out the default RTL codepage which is UTF8 in Lazarus, yet Windows-defined one in FPC/Windows without Lazarus.

Turns out, that there's no single code page variable or function. Results of the filesystem API calls are converted to the codepage, defined in DefaultRTLFileSystemCodePage variable. The only problem is that this variable is present in the source code and is supposed to be in system unit, but the compiler doesn't see it.

Related

What differentiation between GetMessageA and GetMessageW function?

I am learning programming Window GUI. I don't know differentiation between 2 function GetMessageA and GetMessageW. I saw the GetMessage function have not any parameters involve to ANSI or Unicode.
All older Win32 calls that involve strings are actually macros that expand to either a Unicode version or an ANSI version, based upon the "Character Set" property of the project.
GetMessage(...) will map to either GetMessageA(...) or GetMessageW(...) where the "A" version will handle messages that involve strings as ANSI formatted text and the "W" version will use UTF-16.

Ruby system() doesn't accept UTF-8?

I am using Ruby 1.9.3 in Windows and trying to perform an action where I write filenames to a file one per line (we'll call it a filelist) and then later read this filelist, and call system() to run another program where I will pass it a filename from the filelist. That program I'm calling with system() will take the filename I pass it and convert it to a binary format to be used in a proprietary system.
Everything works up to the point of calling system(). I have a UTF-8 filelist, and reading the filename from the filelist is giving me the proper result. But when I run
system("c:\foo.exe -arg #{bar}")
the arg "bar" being passed is not in UTF-8 format. If I run the program manually with a Japanese, chinese, or whatever filename it works fine and codes the file correctly, but if I do it using system() it won't. I know the variable in bar is stored properly because I use it elsewhere without issue.
I've also tried:
system("c:\foo.exe -arg #{bar.encoding("UTF-8")}")
system("c:\foo.exe -arg #{bar.force_encoding("UTF-8")}")
and neither work. I can only assume the issue here is passing unicode to system.
Can someone else confirm if system does, in fact, support or not support this?
Here is the block of code:
$fname.each do |file|
flist.write("#{file}\n") # This is written properly in UTF-8
system("ia.exe -r \"#{file}\" -q xbfadd") # The file being passed here is not encoding right!
end
Ruby's system() function, like that in most scripting languages, is a veneer over the C standard library system() call. The MS C runtime uses Win32 ANSI APIs for all the byte-oriented C stdlib functions.
The ANSI APIs use the Windows system locale (aka 'ANSI codepage') to map between byte-oriented strings and Windows's native-UTF16LE strings which are used for filenames and shell commands. Unfortunately, it is impossible to set the system locale to UTF-8; you can set the codepage to 65001 (Windows's equivalent to UTF-8) on a particular console, but the MS CRT has long-standing bugs in its handling of code page 65001 which make a lot of applications fail.
So using the standard cross-platform byte-oriented C interfaces means you can't support Unicode filenames or shell commands, which is rather sad. Some scripting languages have added support for Unicode filenames by calling the Win32 'W' (Unicode) APIs explicitly instead of the C stdlib interfaces. Ruby 1.9.x is making progress in this area, but system() has not been looked at yet.
You can fix it by calling the Win32 API yourself, for example CreateProcessW but it's not especially pretty.
I upvoted bobince's answer; I believe it correct.
The only thing I'd add is that an additional work-around, this being a windows problem, is to write out the commandline to a batch file and then use system() to call the batchfile.
I used this approach to successfully get around the problem while running Calibre's ebook-convert commandline tool for a book with UTF-8/non-English chars in its title.
I think that bobince answer is correct and the solution that worked for me was:
system("c:\foo.exe -arg #{bar.encoding("ISO-8859-1")}")

Why does GetWindowLong have ANSI and Unicode variants?

I found out today that GetWindowLong (and GetWindowLongPtr) has 'ANSI' (A) and 'Unicode' (W) flavours, even though they don't have TSTR arguments. The MSDN page on GetWindowLong only indicates that these variants exist, but doesn't mention why.
I can imagine that it must match the encoding of CreateWindowEx (which also has A/W flavours) or RegisterClass, but for many reasons, I don't think this makes sense. Apparently, it matters, because someone reported that the Unicode version may fail on XP (even though XP is NT and, as I understand it, all Unicode under the hood). I have also tried to disassemble the 32-bit version of USER32.DLL (which contains both flavours of GetWindowLong), and there is extra work done based on some apparent encoding difference*.
Which function am I supposed to choose?
*The flavours of GetWindowLong are identical, except for a boolean they pass around to other functions. This boolean is compared to a flag bit in a memory structure I can't be bothered to track down using static code analysis.
I believe the reason is explained in Raymond Chen's article, What are these strange values returned from GWLP_WNDPROC?
If the current window procedure is incompatible with the caller of GetWindowLongPtr, then the real function pointer cannot be returned since you can't call it. Instead, a "magic cookie" is returned. The sole purpose of this cookie is to be recognized by CallWindowProc so it can translate the message parameters into the format that the window procedure expects.
For example, suppose that you are running Windows XP and the window is a UNICODE window, but a component compiled as ANSI calls GetWindowLong(hwnd, GWL_WNDPROC). The raw window procedure can't be returned, because the caller is using ANSI window messages, but the window procedure expects UNICODE window messages. So instead, a magic cookie is returned. When you pass this magic cookie to CallWindowProc, it recognizes it as a "Oh, I need to convert the message from ANSI to UNICODE and then give the UNICODE message to that window procedure over there."

C++/CLI Changing encoding

Good Day,
I've been writing a simple program using the Windows API, it's written in C++/CLI.
The problem I've encountered is, I'm loading a library (.dll) and then calling its functions. one of the functions returns char*. So I add the returned value to my textbox
output->Text = System::Runtime::InteropServices::Marshal::PtrToStringAnsi
(IntPtr(Function()));
Now, as you can see this is encoded in ANSI, the char* returned is, I presume, also ANSI (or Windows-1252, w/e you guys call it :>). The original data, which the function in LIBRARY gets is encoded in UTF-8, variable-length byte field, terminated by 0x00. There are a lot of non-Latin characters in my program, so this is troubling. I've also tried this
USES_CONVERSION;
wchar_t* pUnicodeString = 0;
pUnicodeString = A2W( Function());
output->Text = System::Runtime::InteropServices::Marshal::PtrToStringUni
(IntPtr(pUnicodeString));
using atlconv.h. It still prints malformed/wrong characters. So my question would be, can I convert it to something like UTF-8 so I would be able to see correct output, or does the char* loose the necessary information required to do so? Maybe changing the .dll source code would help, but it's quite old and written in C, so i don't want to mess with it :/
I hope the information I provided was sufficient, if you need anything more, just ask.
As I know there is no standard way to handle UTF-8. Try to google appropriate converters, e.g. http://www.nuclex.org/articles/cxx/10-marshaling-strings-in-cxx-cli , Convert from C++/CLI pointer to native C++ pointer .
Also, your second code snippet doesn't use pUnicodeString, it doesn't look right.

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?
That is:
many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
The Windows APIs use wide-characters, not Unicode.
So what does Windows do when you want to code something like ๐ ‚Š (U+2008A) Han Character on Windows?
The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.
So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.
Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file ๐€.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).
But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called ๐€.txt in the same folder as ๐จ.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.
This is how pedants can have a nice long and basically meaningless argument about whether Windows โ€œsupportsโ€ UTF-16 strings or only UCS-2.
Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.
Not all third party programs handle this correctly and so may be buggy with data outside the BMP.
Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

Resources