How to load hunspell dictionary in Windows path with non-ASCII characters? - winapi

How to load hunspell dictionary in Windows path with non-ASCII characters?
Hunspell manual suggests:
In WIN32
environment, use UTF-8 encoded paths started with the long path prefix \?\ to handle
system-independent character encoding and very long path names, too.
So I have code to do the following:
QString spell_aff = QStringLiteral(R"(\\?\%1%2.aff)").arg(path, newDict);
QString spell_dic = QStringLiteral(R"(\\?\%1%2.dic)").arg(path, newDict);
// while normally not a an issue, you can't mix forward and back slashes with the prefix
spell_dic = spell_aff.replace(QChar('/'), QStringLiteral("\\"));
spell_dic = spell_dic.replace(QChar('/'), QStringLiteral("\\"));
qDebug() << "right before Hunspell_create";
mpHunspell_system = Hunspell_create(spell_aff.toUtf8().constData(), spell_dic.toUtf8().constData());
qDebug() << "right after Hunspell_create";
This prefixes \\?\ to the path, uses a consistent directory separator as documented by the note in microsoft documentation, and converts it to UTF-8 encoding with .toUtf8().
Yet running the code out on Windows 10 Pro fails:
How to fix?
Using Qt5, MinGW 7.3.0.
I've also done due research and as far as I can see, LibreOffice does the same thing and it seemingly works for them: sspellimp.cxx, lingutil.hxx, and lingutil.cxx.

You can use GetShortPathNameW to obtain a pure-ASCII path that Hunspell will understand. See QTIFW-175 for an example.
(thanks to Windows directory that will never contain non-ASCII characters for temp file?)

Related

Windows directory that will never contain non-ASCII characters for temp file?

Using MinGW 7.3.0 on Windows, Hunspell can't load the dictionary files from locations that have non-ASCII characters because of Windows limitations. I've tried everything[1] and I'm now resorting to copying the file to a path without ASCII characters before giving it to Hunspell. What is a good location to copy it to?
[1]
Windows requires wchar_t support for std::iostream.open() to work right, which MinGW does not implement
std::filesystem can solve this, but only available in GCC 8
Hunspell insists on loading files on its own, it is not possible to pass the read files as strings to it
The "natural" fit would be the use the user's choosen temporary directory (or subdirectory thereof) (see %temp% or GetTempPath()). However, that defaults to something that contains the user name (which can contain "non-ASCII" characters; e.g. c:\users\Ø¥Ć¼\AppData\LocalLow\Temp) or something arbitrary (regarding character set) all together.
So you're most likely best off to choose some directory that
a) does not contain off-limits characters from the get go. For example, a directory underneath C:\ProgramData that you choose yourself (e.g. the application name) that you know does not contain non-ASCII characters.
b) let the user decide where to put these files and make sure it is not permissible to enter a path that contains only allowed characters.
c) Pass the "short path name" to Hunspell, which should not contain non-ASCII characters for compatibility with FAT file system traits. For example, the short path name for c:\temp\Ø¥Ć¼ is c:\temp\571D~1.
You can see the short names for directories using cmd.exe /c dir /x:
C:\temp>dir /x
...
19.07.2019 15:30 <DIR> .
19.07.2019 15:30 <DIR> ..
19.07.2019 15:30 <DIR> 571D~1 Ø¥Ć¼
How you can invoke the GetShortPathName Win32 API from MinGW I don't know, but I would assume that it is possible.
Also make sure to review the MSDN page for the above function for traitoffs, e.g. short names are not supported everywhere (e.g. SMB + see comments below).
From this bug tracker:
In WIN32 environment, use UTF-8 encoded paths started with the long
path prefix \\?\ to handle system-independent character encoding
and very long path names (without the long path prefix Hunspell will
use fopen() with system-dependent character encoding instead of
_wfopen()).
So the actual solution seems to be:
Call GetFullPathNameW to normalize the path. Required because paths with long path prefix \\?\ are passed to the NT API unchanged.
Prepend L"\\\\?\\" to the normalized path (backslashes doubled because of C string literal requirements).
For a UNC path, you have to use the "UNC" device directly (i. e. L"\\\\server\\share" → L"\\\\?\\UNC\\server\\share" (thanks eryksun)
Encode the path in UTF-8, e. g. using WideCharToMultiByte() with CP_UTF8.
Pass the final UTF-8 encoded path to Hunspell.
It looks like C:\Windows\Temp is still a valid path you can write to yourself.

How do I properly unzip a zip with Chinese character that from Windows in OSX?

One day I just zipped a file with Chinese character called 周國賢 - 密封罩.flac, to a zip, using bandizip & designated encoding to utf-8.
And then I try to unzip it in my MacbookPro, which is (probably) using Macintosh as encoding. The file unzipped is called ©P∞ÍΩ - ±K´ ∏n.flac, which does not match the above Chinese name.
So, I try to test about the encoding, and found that Macintosh->big5 would return the Macintosh mysterious symbol into Cantonese, but have some unmatching characters: 周衰�璀� - 密封罩.flac.
I have tried another file: §˝µ· - ¨ı®ß.ape: and it actually output the correct name of the file: 王菲 - 紅豆.ape
So, here is my question: how do I unzip a file that with big5 chinese character properly and without any information loss? Or how do I zip a file correctly to prevent information loss/ incorrect characters? (edit #2: you can use bandizip to zip the file into utf-8 encoding)
BTW, The encoding converter I am using is https://r12a.github.io/apps/encodings/, which could be quite helpful for you to check for encoding. Don't forget to click change encodings shown. And I am not the owner of the encoding converter.
edit #1: I have found that the setting in bandizip is wrong...well sorry for the inconvenience caused. Nonetheless, I figure out that The Unarchiver in Mac Apple Store can unzip big5 correctly. This can be a workaround, but still I don't know how to unzip big5 characters properly WITHOUT any loss.

How to automatic search and replace for 18n.properties file with WebStorm

For SAPUI5 there are i18n.properties files.
For the German language I need to replace the special German chars with the unicode codes.
# AE = \u00C4, ae = \u00E4
# OE = \u00D6, oe = \u00F6
# UE = \u00DC, ue = \u00FC
# SZ = \u00DF
How can I automate this search and replace with WebStorm?
You could just use WebStorms 'Replace in Path' (CMD+SHIFT+R on Mac) on your i18n folder. IntelliJ IDEA has better editing support for .properties files though (since they are coming from java)
Will be also easy to do this via a node script/bash script/gulp task whatsoever.
Btw: Is this really needed? Having all .properties files in UTF-8 should just do the trick. Afaik only Tomcat got confused by that since in the Java spec these files are ISO-8859-1 by definition. As long as you are deploying to a platform that accepts them as UTF-8 there shouldn't be an issue.
BR
Chris
PS: That code looks really familiar ;D

Accessing files with invalid names in Windows

I'm running Windows 7 in a VirtualBox on an OS X 10.8 host. The host has a shared folder with a file named >>>FILE<<< inside. Apparently, OS X itself has no problem with such file names. Unfortunately, I can't seem to open this files in Windows 7 because of the <s and >s in the name.
In C, this call fails:
CreateFileW(
L"\\\\VBOXSVR\\ft1\\>>>FILE<<<",
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL
);
GetLastError returns ERROR_INVALID_NAME (123). If I change the file name into FILE, I get a valid handle and everything is fine.
Is there a known way in Windows to access those files with invalid characters in their names? Supposing a productive environment with no direct write access to the host's file system.
#jcophenha's answer was on the right track. However, if you read the page that #jcopenha linked to, it states that the \\?\ prefix is for local paths only. You have to use the \\?\UNC\ prefix instead for UNC paths, eg:
L"\\\\?\\UNC\\VBOXSVR\\ft1\\>>>FILE<<<"
The < and > characters are not allowed to be used in a Windows file name. And so that file cannot be opened under Win32.
The naming conventions documentation lists the following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Windows differs significantly in this area from *nix systems. On *nix there are typically no such OS enforced limitations on the characters that can be used in a file. As a friend of mine once discovered when he tried to delete a file named * and suffered the most unfortunate consequences.
Now, it is conceivable that these limitations do not apply when using the native API. You could try and open the file with NtCreateFile. That may just work!

How to get the parent folder of a Windows user's profile path using C++

I am trying get the parent folder of a Windows user's profile path. But I couldn't find any "parameter" to get this using SHGetSpecialFolderPath, so far I am using CSIDL_PROFILE.
Expected Path:
Win7 - "C:\Users"
Windows XP - "C:\Documents and Settings"
For most purposes other than displaying the path to a user, it should work to append "\\.." (or "..\\" if it ends with a backslash) to the path in question.
With the shell libary version 6.0 you have the CSIDL_PROFILES (not to be confused with CSIDL_PROFILE) which gives you what you want. This value was removed (see here), you have to use your own workaround.
On any prior version you'll have to implement your own workaround, such as looking for the possible path separator(s), i.e. \ and / on Windows, and terminate the string at the last one. A simple version of this could use strrchr (or wcsrchr) to locate the backslash and then, assuming the string is writable, terminate the string at that location.
Example:
char* path;
// Retrieve the path at this point, e.g. "C:\\Users\\username"
char* lastSlash = strrchr(path, '\\');
if(!lastSlash)
lastSlash = strrchr(path, '/');
if(lastSlash)
*lastSlash = 0;
Or of course GetProfilesDirectory (that eluded me) which you pointed out in a comment to this answer.

Resources