How to read/write Chinese/Japanese characters from/to INI files? - winapi

Using WritePrivateProfileString and GetPrivateProfileString results in ??? instead of the real characters.

GetPrivateProfileString() and WritePrivateProfileString() will work with Unicode, sort of.
If the ini file is UTF-16LE encoded, i.e. it has a UTF-16 BOM, then the functions will work in Unicode. However if the functions have to create the file they will create an ANSI file and only work in ANSI.
So to use the functions with Unicode, create your ini file before you first use it and write a UTF-16LE Byte Order Mark to it. Then carry on as normal.
Note that the functions do not work at all with UTF-8.
See Michael Kaplan's blog for more detail than you ever wanted to know about this.

The WritePrivateProfileStringW function will write the INI file in legacy system encoding (e.g. Shift-JIS on a Japanese system) because it is a legacy support function. If you want to have a fully Unicode-enabled INI file, you will need to use an external library.
Try SimpleIni http://code.jellycan.com/simpleini/
It is C++, single header file, template library with an MIT licence (i.e. commercial use is OK). Include it into your source file and use it. It is cross-platform, supports UTF-8 and legacy encoded files, and can read and write the INI file largely preserving comments and structure, etc. Easiest to check out the page.
It's been around for a while and is appears to be used by quite a number of people. I wrote it and continue to support it.

According to the WritePrivateProfileString documentation, there is a Unicode version: WritePrivateProfileStringW. Use that, and you should be able to use Unicode characters.

It might just be a problem with how you are displaying or handling the strings.
For example, the normal console window can't display japanese strings with printf.
Can you post some of your code?

Related

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.
I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.
I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

fopen with unicode filename

I have to use a library that accepts file names as strings (const char*). Internally files are opened with fopen. Is there a way to make this library to accept unicode file name? Can I use WideCharToMultiByte to convert unicode names into utf before passing them to the library?
One possible (undesirable) solution is to change library interface (char* -> wchar_t*) and replace fopen with windows specific _wopen. Another solution is to use create symbolic links to files and pass those to the library, but it is limited to NTFS volumes only.
Best way would be to rewrite the lib... Just my 2 Cents.
But if it is just about to open an existing file you can use GetShortPathName
You find an existing discussion about this way here.
Using WideCharToMultiByte you are only able to open files that have file names that contain only ANSI characters. This is because the ANSI variants (using a "char *" type argument) of file functions are not able to open files that contain characters above 255 in the file name.
Using GetShortPathName has the disadvantage that it might not work on certain file systems (maybe certain types of network drives) that do not support "8.3" file names.
I would rewrite the library using the "_wfopen" function (the UNICODE equivalent of "fopen" is "_wfopen", not "_wopen").
Please note that the second argument of "fopen" must also be an UNICODE string when using _wfopen.

What's the best way to identify unicode encoded text files in Windows?

I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.
Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.
See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”
UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
UTF-16 can be detected by looking for the BOM.
UTF-32 can be detected by validation, or by the BOM.
Otherwise assume the ANSI code page.
Our codebase doesn't include any
non-ASCII chars. I will try to grep
for the BOM in files in our codebase.
Thanks for the clarification.
Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.
Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.
Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.
ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.
Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.
Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.
findstr /I /C:"SomeKnownString" file.txt
It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:
FindStr /I /C:"P" file.txt
You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.
Hope this helps.
If you're looking for a programmatic solution, IsTextUnicode() might be an option.
It's kind of hard to say, but I'd start by looking for a BOM. Most Windows programs that write Unicode files emit BOMs.
If these files exist in your codebase presumably they compile. You might ask yourself whether you really need to do this "tidying up". If you do need to do it then I would ask how the tool chain that processes these files discovers their encoding. If you know that then you'll be able to use the same diagnostic.

Extended charachter code pages

I was trying to Validate characters (the extended ones) and i see that in various PC's they have in deferent places the extended characters. I meane we are not see the same ASCII code number for a certain character (not in Latins).
Now My issue is what i have to do when my program starts to use always a certain ASCII code table?
For extended character of course.
This issue generally relates (since .NET strings are UTF-16) only to reading and writing text files. In which case, just use Encoding.GetEncoding(codePage) to choose the appropriate encoding, and use this when access any text files. All standard inbuilt text/file utility operations will take an encoding, for example:
string contents = File.ReadAllText("foo.txt", encoding);

Getprivateprofilestring Bug

I encrypted some text and put it in a INI file. Then I used getprivateprofilestring() to retrieve the value but some of the end characters are missing. I suspect it may be a new line character causing it to be incomplete. Writing to the INI file is OK. Opening the INI file and looking at the sections and keys - everything is in order. Its just the retrieving part that causes the bug.
Please any help would be appreciated.
Thanks
Eddie
First off when encrypting strings, make sure that they are converted to Base64 before dumping them into the INI file.
Most likely, the encrypted string created an ascii character which is not handled very well by the INI related APIs.
WritePrivateProfileStringW writes files in the active ANSI codepage by default; WritePrivateProfileStringA will always write ANSI.
To achieve the best results, follow the directions here and use GetPrivateProfileStringW when reading the data back
It's more than likely that the encryption is injecting a NULL character into the stream you are writing. GetPrivateProfileString will read a string till it finds a NULL character.
So I agree with Angry Hacker, convert to Base64 or some other friendly human readable encoding and you won't have any problems.

Resources