fopen with unicode filename - windows

I have to use a library that accepts file names as strings (const char*). Internally files are opened with fopen. Is there a way to make this library to accept unicode file name? Can I use WideCharToMultiByte to convert unicode names into utf before passing them to the library?
One possible (undesirable) solution is to change library interface (char* -> wchar_t*) and replace fopen with windows specific _wopen. Another solution is to use create symbolic links to files and pass those to the library, but it is limited to NTFS volumes only.

Best way would be to rewrite the lib... Just my 2 Cents.
But if it is just about to open an existing file you can use GetShortPathName
You find an existing discussion about this way here.

Using WideCharToMultiByte you are only able to open files that have file names that contain only ANSI characters. This is because the ANSI variants (using a "char *" type argument) of file functions are not able to open files that contain characters above 255 in the file name.
Using GetShortPathName has the disadvantage that it might not work on certain file systems (maybe certain types of network drives) that do not support "8.3" file names.
I would rewrite the library using the "_wfopen" function (the UNICODE equivalent of "fopen" is "_wfopen", not "_wopen").
Please note that the second argument of "fopen" must also be an UNICODE string when using _wfopen.

Related

Rules for file extensions?

Are there any rules for file extensions? For example, I wrote some code which reads and writes a byte pattern that is only understood by that specific programm. I'm assuming my anti virus programm won't be too happy if I give it the name "pleasetrustme.exe"... Is it gerally allowed to use those extensions? And what about the lesser known ones, like ".arw"?
You can use any file extension you want (or none at all). Using standard extensions that reflect the actual type of the file just makes things more convenient. On Windows, file extensions control stuff like how the files are displayed in Windows Explorer and what happens when you double click on it.
I wrote some code which reads and writes a byte pattern that is only
understood by that specific programm.
A file extension is only an indication of what type of data will be inside, never a guarantee that certain data formatted in a specific way will be inside the file.
For your own specific data structure it is of course always best to choose an extension that is not already in use for other file formats (or use a general extension like .dat or .bin maybe). This also has the advantage of being able to use an own icon without it being overwritten by other software using the same extension - or the other way around.
But maybe even more important when creating a custom (binary?) file format, is to provide a magic number as the first bytes of that file, maybe followed by a file header structure containing a version number etc. That way your own software can first check the header data to make sure it's the right type and version (for example: anyone could rename any file type to your extension, so your program needs to have a way to do some checks inside the file before reading the remaining data).

Unicode filenames on FAT-32?

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).
But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.
Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx
But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx
And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.
So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
And more specific question:
What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
You might have to experiment here. This is a great question, and I'm not 100% confident, but:
So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?
The "OEM codepage", whatever that is for the system.
Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)
And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.
Some citations for these answers:
First, this page tells us that the OEM code page != the Windows code page:
Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.
On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).
If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:
ASDFΣ.TXT asdfΣ.txt
However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:
ASDF~1.TXT asdf?.txt
(You'll likely see ?, because cmd.exe's font cannot display a Λ.)
For information about long filenames, see this Wikipedia article.
Also, interestingly, if you name a file "asdf©.txt", you might get:
ASDFC.TXT asdfc.txt
… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:
ASDFC.TXT asdf©.txt
And this is why you should use the FooW functions.
The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

Meaning of this string \\.\c:

I'm reading this. Here I've found some code lines, for example: wsprintf(szDrive, "\\\\.\\%c:", *lpszSrc); I want to ask, what does this string give?
I tried to look for information but all that I've found is:
In the ANSI version of this function, the name is limited to MAX_PATH
characters. To extend this limit to 32,767 wide characters, call the
Unicode version of the function and prepend "\\?\" to the path. For
more information, see Naming Files, Paths, and Namespaces.
and this do not answer into my question, so asking here. As I think it should be connected with windows specific or NTFS but not sure about that.
The %c is the single character format specifier for wsprintf.
The code is used to generate path names of this form:
\\.\C:
This is the path to a physical volume. You use such a path when performing file operations directly on a volume, bypassing the file system. So you'd use such a path when implementing raw disk copy, for example. The documentation for CreateFile has more detail.
This all ties in with the fact that the code you found this in performs a raw disk copy.

What encoding are filenames in NTFS stored as?

I'm just getting started on some programming to handle filenames with non-english names on a WinXP system. I've done some recommended reading on unicode and I think I get the basic idea, but some parts are still not very clear to me.
Specifically, what encoding (UTF-8, UTF-16LE/BE) are the file names (not the content, but the actual name of the file) stored in NTFS? Is it possible to open any file using fopen(), which takes a char*, or do I have no choice but to use wfopen(), which uses a wchar_t*, and presumably takes a UTF-16 string?
I tried manually feeding in a UTF-8 encoded string to fopen(), eg.
unsigned char filename[] = {0xEA, 0xB0, 0x80, 0x2E, 0x74, 0x78, 0x74, 0x0}; // 가.txt
FILE* f = fopen((char*)filename, "wb+");
but this came out as 'ê°€.txt'.
I was under the impression (which may be wrong) that a UTF8-encoded string would suffice in opening any filename under Windows, because I seem to vaguely remember some Windows application passing around (char*), not (wchar_t*), and having no problems.
Can anyone shed some light on this?
NTFS stores filenames in UTF-16, however fopen is using ANSI (not UTF-8).
In order to use an UTF16-encoded file name you will need to use the Unicode versions of the file open calls. Do this by defining UNICODE and _UNICODE in your project. Then use the CreateFile call or the wfopen call.
fopen() - in MSVC on windows does not (by default) take a utf-8 encoded char*.
Unfortunately utf-8 was invented rather recently in the great scheme of things. Windows APIs are divided into Unicode and Ansi versions. every windows api that takes or deals with strings is actually available with a W or A suffix - W for "Wide" character/Unicode and A for Ansi. Macro magic hides all this away from the developer so you just call CreateFile with either a char* or a wchar_t* depending on your build configuration without knowing the difference.
The 'Ansi' encoding is actually not a specific encoding:- But means that the encoding used for "char" strings is specific to the locale setting of the PC.
Now, because c-runtime functions - like fopen - need to work by default without developer knowledge - on windows systems they expect to receive their strings in the windows local encoding. msdn indicates the microsoft c-runtime api setlocal can change the locale of the current thread - but specifically says that it will fail for any locales that need more than 2 bytes per character - like utf-8.
So, on Windows there is no shortcut. You need to use wfopen, or the native API CreateFileW (or create your project using the Unicode build settings and just call Createfile) with wchar_t* strings.
As answered by others, the best way to handle UTF-8-encoded strings is to convert them to UTF-16 and use native Unicode APIs such as _wfopen or CreateFileW.
However, this approach won't help when calling into libraries that use fopen() unconditionally because they do not support Unicode or because they are written in portable C. In that case it is still possible to make use of the legacy "short paths" to convert a UTF-8-encoded string into an ASCII form usable with fopen, but it requires some legwork:
Convert the UTF-8 representation to UTF-16 using MultiByteToWideChar.
Use GetShortPathNameW to obtain a "short path" which is ASCII-only. GetShortPathNameW will return it as a wide string with all-ASCII content, which you will need to trivially convert it to a narrow string by a lossless copy casting each wchar_t char.
Pass the short path to fopen() or to the code that will eventually use fopen(). Be aware that error messages printed by that code, if any, will refer to the unsightly "short path" (e.g. KINTO~1 instead of kinto-un-筋斗雲).
While this is not exactly a recommended long-term strategy, as Windows short paths are a legacy feature that can be turned off per-volume, it is likely the only way to pass file names to code that uses fopen() and other file-related API calls (stat, access, ANSI versions of CreateFile and similar).

How to read/write Chinese/Japanese characters from/to INI files?

Using WritePrivateProfileString and GetPrivateProfileString results in ??? instead of the real characters.
GetPrivateProfileString() and WritePrivateProfileString() will work with Unicode, sort of.
If the ini file is UTF-16LE encoded, i.e. it has a UTF-16 BOM, then the functions will work in Unicode. However if the functions have to create the file they will create an ANSI file and only work in ANSI.
So to use the functions with Unicode, create your ini file before you first use it and write a UTF-16LE Byte Order Mark to it. Then carry on as normal.
Note that the functions do not work at all with UTF-8.
See Michael Kaplan's blog for more detail than you ever wanted to know about this.
The WritePrivateProfileStringW function will write the INI file in legacy system encoding (e.g. Shift-JIS on a Japanese system) because it is a legacy support function. If you want to have a fully Unicode-enabled INI file, you will need to use an external library.
Try SimpleIni http://code.jellycan.com/simpleini/
It is C++, single header file, template library with an MIT licence (i.e. commercial use is OK). Include it into your source file and use it. It is cross-platform, supports UTF-8 and legacy encoded files, and can read and write the INI file largely preserving comments and structure, etc. Easiest to check out the page.
It's been around for a while and is appears to be used by quite a number of people. I wrote it and continue to support it.
According to the WritePrivateProfileString documentation, there is a Unicode version: WritePrivateProfileStringW. Use that, and you should be able to use Unicode characters.
It might just be a problem with how you are displaying or handling the strings.
For example, the normal console window can't display japanese strings with printf.
Can you post some of your code?

Resources