Is there an i18n shell in windows that supports a large character set? Testing my application in windows results in some math characters not being rendered correctly. The Lucida font in cmd.exe and powershell do not have a wide enough selection.
Unicode UTF-8 would be the most preferable, followed by the other Unicode encodings.
I'm not sure if this is a problem in the font or the console itself but you could try installing the DejaVu Sans Mono font and see if that provides the necessary characters.
CMD.EXE supports it just fine; the issue is that it is doesn't allow a whole lot of other fonts by default and Lucida Console, usually the only TrueType font there, has no fonts defined in the font fallback chain. See http://www.siao2.com/2008/03/19/8323216.aspx and the screenshots I link to in the comments for that blog post.
You may want to see http://www.siao2.com/2006/10/19/842895.aspx on how to make more fonts appear amongst those you can choose as the main console font.
Also, make sure that your application really uses a Unicode codepage for its output - http://illegalargumentexception.blogspot.com/2009/04/i18n-unicode-at-windows-command-prompt.html probably explains the issue better than I could (or, at the very least, as well as I could).
I just found the ActiveState Tcl does a really good job with tkcon.
When starting tkcon.tcl, I just have to type:
encoding system utf-8
It works well and even has tab completion. Of course, it is a Tcl shell and not a system shell.
It seems to be able to find characters for all of the symbols I am currently using in the test suite for my application.
While working under Windows, I use the DejaVu Sans Mono font along with Console for getting better Unicode (UTF-8) support.
Related
I am having precisely the problem reported by this guy. I have installed the latest version of emacs on my Windows computer, and I find that pasting Greek text into an emacs buffer works fine so long as I change the font from the default but that typing in Greek does not work. I did some web searching about the problem, and it seems there may be some old-fashioned workarounds, but I I don't understand why entering Greek using the standard Windows polytonic Greek keyboard doesn't just work, as it does in all (most?) other Windows programs.
By the way, another issue I have noticed is that there seems to be quite a restricted number of fonts that have polytonic Greek glyphs. (And I haven't found any fixed width ones at all!) Is there any way to make emacs always display the correct characters, even if it has to borrow the glyphs from another font? Surely the ugliness of something being in the wrong font is better than the brokenness of it not showing up at all.
For the keyboard problem, you might want to M-x report-emacs-bug since it sounds like something we should fix (but I can't fix it myself). For the fonts, this part has seen regular improvements with every new Emacs release, so if you're running Emacs-23, you might want to try again with Emacs-24, and if it's still not showing you those glyphs, please M-x report-emacs-bug as well.
Can anyone tell me does windows server 2003 come with unicode font that can be used in crystal report?
"Unicode font" is an imprecise term for a font with wide coverage of the Unicode character set. Microsoft has two such fonts (that I'm aware of): Arial Unicode MS and Lucida Sans Unicode. Neither one comes with older versions of Windows.
So the answer to your question is no.
Arial Unicode MS is included in most versions of Office, so it's not uncommon to find that one on a machine with an older OS, but you cannot rely on it being there. It also has some deficiencies with respect to kerning and certain combining marks, even compared to the regular Arial font (that doesn't have the broad script support).
Your best bet is to rely on the OS to do font linking and font fallback for you. If that's not an option, you'll have to implement your own, but it's not easy.
If you can't find a suitable font installed by default in Windows, perhaps you can use one that can be installed freely. There are many such, and guides to help you find them. One guide: http://www.unifont.org/fontguide/
As far as I know, Unix-like systems use UTF-8 for encoding filenames, while Windows system use their own Windows single-byte encodings.
I am working with archives with japanese filenames in them quite oftenly. When I open such archive created in Windows, japanese letters are dead, because filename encoding is incorrect.
Same thing happens, when I create archive in my Linux and then someone opens it under Windows.
So, I have thought that this should be quite common problem, and, because filenames are recoverable there must already exist correcting .sh script for linux and .bat script for Windows.
But after googling for quite a long time I still have not found anything.
Is there such scripts at all? If not, what difficultuies may have stopped people from creating them?
Update
I would be happy with a solution that works for most Linux systems and most Windows systems.
Windows uses the two byte encoding UTF-16. Your problem is most likely that you are using single byte ANSI versions of whatever archive tool you are using.
Until you give more details of the code and tools you are using it's hard to give specific advice. However, the are no limitations on using the full range of Unicode characters in modern Windows file systems.
Thank you for your input. Case indeed looks quite complex for simple bash script, I'll need to use programming language.
I don't see anything like "close question" button, so I'll use this answer to do so.
Take a look to convmv tool available for Unix systems
The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?
I would like all my toolkit to use UTF-8 but find that some tools on Windows seem to use CP1252 (which appears to be Windows-specific). Does this create output which is incompatible and if so at which codepoints? If so, can I do anything about it?
(I don't completely understand the issues so I'd be grateful for basic education on these encodings).
Tools hard-coding for code page 1252 on Windows is very unlikely. Much more likely is that it happens to be the default code page on your machine. 1252 is used in Western Europe and the Americas. It is configured in Control Panel, Regional and Language options. They've been using different names for it, on Win7 it is in the Administrative tab, Change System Locale.
Yes, many tools use the default code page unless they have a good reason to chose another encoding. The BOM is such a good reason. Notable examples are Notepad (unless you change the Encoding in the File + Open dialog to something else than Ansi) and C/C++ compilers. There typically isn't anything special you need to do to use the default code page. Guessing the correct code page for a text file when you don't have a BOM is impossible to do accurately. Google "bush hid the facts" for a very amusing war story.
Six years old and still relevant: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, about your question: Yes, there are still tools out there that choke on UTF-8 files. But more and more tools are "getting it". If you're developing your own stuff, you might want to look into Python 3 where all strings are Unicode. The philosophy is to convert all your inputs into Unicode (if necessary) as early as possible, and reconvert them to a target encoding as late as possible. There are toolkits out there that will do a good job of guessing the encoding of a particular file (for example, Mark Pilgrim's chardet, a port of Mozilla's encoding detector). This is nice if you're working with files that don't specify an encoding.
CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
Most of the Windows tools will use whatever is set as the current user's current codepage, which will default to 1252 for US Windows. You can change that to another codepage pretty easily. But UTF-8 is NOT one of the available codepage options for Windows. (I wish it was).
Some utilities under Windows will understand the UTF-8 byte-order mark at the start of a file. Unfortunately I don't know how to determine if this will work except to try it.
UTF-8 is supported on Windows but not as a current codepage. You can use UTF-8 for converting to/from it but you cannot set is as current codepage.
First do not try to waste time by setting the codepage - this approach will remind you of Sisyphus myth - you can't really solve the problem using codepages, you have to use Unicode.
The only real solution for you is to build your application as Unicode so it will use UTF-16 and to convert to/from UTF-8 on in/out operations. This is done quite simple because fopen supports reading or writing UTF-8.
Regarding the usage of other Windows tools with UTF-8 file, you should not be aware because if the tool is able to work with ASCII it will work with UTF-8 (even so it may not be able to distinguish between Unicode chars but at least it will be able to load/parse the files).
BTW, You forgot to specify what programming language are you using and what Windows tools are you considering for usage.
Also, if you ware interested about more internationalization stuff please visit my blog.i18n.ro