I understand that the sqlite3 command (as well as the lib version) can be built with a statically linked ICU lib to fully support unicode operations. It should also be possible to dynamically load an ICU extension.
But neither of these seem to be the case with OS X's sqlite3 command (as of 10.10.5).
Here's the test I'm using to determine ICU presence:
SELECT upper('ä');
This should result in "Ä" if ICU is used by the engine. "ä" indicates missing ICU support. Is that a valid test?
Apart from compiling a new command and probably dylib, too, is there another way to enable unicode support so that upper, lower and comparisons know about unicode letters and not just ASCII?
I am developing an application with the Corona SDK. I want to send UTF-8 encoded text to the server and also decode the responses (also in UTF-8.)
Is somebody familiar with UTF-8 encoding and decoding functions that I can use?
The Lua standard library doesn't have any functions dealing with UTF-8 characters. You'd have to use some external library (like ICU) or use an existing Lua binding for one.
The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?
Under Windows the only way to get Unicode support is to use wchar_t (UTF-16 under Windows) instead of char.
The problem is that I found that at least one of the boost libraries (boost::program_options) doesn't support Unicode at all: you are not able to compile the examples as Unicode.
Shouldn't boost be able to compiled with wide strings? - I would even expect this to be the default behavior under Windows.
I'm using Eclipse in Ubuntu to edit PHP files.
But, unfortunately, some of these PHP files were created in Notepad++ in Windows XP, with ANSI encoding defined.
Also, these files generates HTML codes with charset=ISO-8859-1.
When I configured Eclipse to ISO-8859-1, many special characters were lost and changed to '???', and when I try to save a file with ISO enconding, Eclipse displays an error that was not possible to save the file because some characters aren't compatible with the charset.
How can I save these files without changing the encoding, or how can I change the encoding without lose characters.
To the point, you need to read those files using ANSI encoding and then write those files using ISO-8859-1 encoding. In Notepad++ you can change the encoding by Format menu option. Unfortunately there's no ISO-8859-1 option, but UTF-8 should suffice and is nowadays also the preferred choice for world domination since the ISO-8859-1 encoding only covers latin characters, not for example Cyrillic, Greek, Chinese, Arabic, etcetera.
By "ANSI" do you mean "Windows code page 1252"?
In either case, once you figure out the source encoding you can use iconv to convert from that encoding to UTF-8.
The latest version can CONVERT between ISO-8859-1 and UTF-8 without loosing info.
version 5.6.8 is able to do so.