In Win7, Unicode/ UTF-8 text file: gibberish on Windows console (Trying to display hebrew) - windows

I have a wide-character file (with Hebrew text) that looks fine in Notepad (saved in "UTF-8 encoding"), reads fine in Notepad++, and when I copy-and-paste into MS Word it looks fine too. But when I open a "DOS box" (Windows console) and go: "type file.txt", it prints gibberish.And yes, I've done all the recommendations for Unicode on Windows console: I opened the console using "cmd /u", I changed the font to Lucida, and I've entered: "chcp 65001".
The problem is identical on a PC running Windows 7, and on another PC running Windows XP SP3.

The Font Courier New supports hebrew and can be added to the command prompt. The default fonts are consolas, lucida, raster, none of them support hebrew. So add Courier New to the command prompt.
It's a registry hack to do that
http://www.howtogeek.com/howto/windows-vista/stupid-geek-tricks-enable-more-fonts-for-the-windows-command-prompt/
http://www.techrepublic.com/blog/windows-and-office/quick-tip-add-fonts-to-the-command-prompt/
This is a good example of how to install fonts, but I should remove a lot of these entries, because most of them didn't get added to cmd because cmd didn't support them.
Lucida and Consolas are defaults.
Raster is a default not listed here maybe 'cos it's a TTF
Of all these I tried to add, only 3 added(are supported by cmd)
Courier New, DejaVu Sans Mono, Droid Sans Mono
DejaVu Sans Mono and Droid Sans Mono are downloadable, supported by cmd, might have some good unicode support/characters, but don't include Hebrew
I have
Consolas <-- default
Courier New <--- added
DejaVu Sans Mono <-- added
Droid Sans Mono <-- added
Lucida Console <-- default
Raster Fonts <-- default
Common hebrew fonts are Miriam and David, but they can't be added to the command prompt.
For the record, Babelmap can list all fonts on your system that support hebrew e.g. in babelmap- click fonts..font coverage, then enter 05D0(that's aleph). I think all these fonts exist on a default windows 7 installation
Aharoni, Arial, Courier New, David, FrankRuehl, Gisha, Levenim MT, Lucida Sans Unicode, Microsoft Sans Serif, Miriam, Miriam Fixed, Narkisim, Rod, Segoe WP, Tahoma, Times New Roman
But most or all of those fonts with hebrew aren't supported in the command prompt, except Courier New. In fact most fonts full stop aren't supported in the command prompt, not even "times new roman"(because "times new roman" is not mono-spaced / fixed width, and that's one of a number of criteria for it to be supported, other criteria seem to be more obscure).
So now you can have Courier New added and selected for use in the command prompt.
And so you can paste unicode characters onto cmd provided the selected font supports it.
To copy/paste, click the Copy button in charmap
Now it's in the clipboard
To paste it into the command prompt, in win7 paste into command prompt isn't ctrl-v. You right click and choose paste. (or if in quickedit mode then just rightclick)
That's the main thing.
Additionally
Often in windows one might use notepad and character map.. but one should be aware of some limitations with them.
Character map shows the first 65536 unicode characters when the font you selected supports it, and character map shows you the UTF-16 code. That's ok, you can still paste from character map into a cmd.exe window, but you should know that commands run in cmd.exe and pipes don't support utf-16. So you can use character map, find a character e.g. aleph 05d0, but it's worth looking up the character on http://www.fileformat.info/info/unicode/char/05d0/index.htm and seeing that while the utf-16 code is 05d0, the utf-8 code is d790. The xxd command and file command is useful for seeing the real contents of a file and determining the file's type.
Notepad is a bit limited when it comes to unicode or any character in the unicode character set whose UTF16 code is > FF. And cmd is a bit limited in regard to some commands like 'type', and in regard to pipes and redirection.
If using cmd.exe you really need pipes to work 'cos pipes are important..
Pipes are limited to the encodings that can be specified by the CHCP Command.
(Note that if CHCP tells you you are on a particular codepage, e.g. 850, it's telling you the input encoding. If you run the command chcp 850 it will change both the input and output encodings. Usually they are the same. It's simpler when they are the same. But if you used some other program to change the encoding of cmd eg the c# compiler has a switch that changes it, then it's best to change it with chcp so you know both encodings are set ).
There is a CHCP 1200 (UTF-16LE) and 1201(UTF-16BE) , but neither are supported, if you try it it will say invalid codepage (tested in win7). CHCP doesn't support UTF-16(it doesn't support UTF16LE or UTF16BE). There is CHCP 65001 (That's UTF-8 without BOM). And there is CHCP 862 (the old fashioned way as in MSDOS days way, of encoding Hebrew, that I mentioned)
The type command supports UTF16LE as does notepad(What notepad calls Unicode, is UTF-16 LE), But pipes and redirection don't support that. The type command also supports any codepage specified/supported by CHCP. So type supports 862 or 65001.
So you could use notepad save it as UTF8 (which is with BOM), then fiddle around to remove the BOM. (That's a bit overkill).. Or you could use notepad, save it as Unicode UTF 16LE.. But then you can't sue pipes.. (that's bad).. Easiest thing to do is use a text editor like notepad2 or notepad++, that supports UTF8 without BOM.
Or if doing everything from cmd you could use 862 or 65001. Though many text editors might not give good support of 862. So you might prefer 65001.
If you want to write any file in notepad and it has a character greater than what in UTF16 is referred to as \uFF, and you want to run commands in cmd.exe on that file, then some commands (e.g. the type command), will have problems if you don't take into account what is supported by what.
Notepad supports UTF-16BE, UTF-16LE and UTF-8 with BOM. That's not good. And no need to fiddle around with xxd and sed or other commands to remove the BOM. If you have any file with a so-called unicode character, a character outside of the regular ascii range. A character > UTF-16's \uFF, as shown by character map as being > \uFF, then use Notepad2 or notepad++
Type supports UTF16LE, and any codepage set by CHCP e.g. 65001 or 862.
Pipes and redirection go by whatever is set by CHCP.
Codepage 862 is old so Codepage 65001 is a good way to go.
xxd and file are useful for seeing how a file is encoded which can be helpful if you have issues. But not absolutely necessary.
So if you want to write a file for use in CMD, and it has some unicode characters, while thee are some commands like xxd and sed that could be used to remove a BOM, and other commands to do so. The easiest way to make such a file in a text editor is to use a text editor like notepad2 or notepad++ which supports UTF8 without BOM.
Getting hebrew displaying might be the most important thing to do first, as described above. And the next thing is being able to save files in a text editor that you can display with e.g. 'type'.
And if you ever want to copy from the command prompt, if not in quickedit mode, then right click then choose mark then select it then hit ENTER. And to paste right click and choose paste.
An further additional point is
Apparently there are bugs in chcp 65001 where some batch files won't run and maybe some C programs won't work either. How to use unicode characters in Windows command line? And i've even seen the c sharp compiler crash when cmd is in codepage 65001 (though one may blame the c sharp compiler, one could also blame 65001) Why is csc.exe crashing when I last left the output encoding as UTF8?
Note- an earlier revision of this answer had some command line examples but they were unnecessarily complex. I might at some point add some commands that demonstrate what I have been describing but it's fairly trivial.

/u is for UTF-16LE, not UTF-8. This is why saving the file as UTF-16LE (what Windows/Notepad misleadingly calls "Unicode") and running with /u works, in as much as it does.
UTF-8 should be achievable with chcp 65001, but there are some nasty low-level bugs in the Microsoft C Runtime for this code page, which makes some apps unreliable and some not run at all.
So yeah, I'm sorry, but UTF-8 is a second-class citizen under Windows. Anything that uses the 'ANSI' interfaces for IO, including anything that uses the C standard IO library, including the Command Prompt, won't be able to cope with it properly.
The only reliable way to get Unicode output in Command Prompt is to use the Windows-specific WriteConsoleW interface to push Unicode strings directly. Unfortunately as this is not available cross-platform, many tools won't use it.
In any case, even when you've got the encoding right, you still have to have a font in the Command Prompt that contains the characters you want. I believe this is why you still aren't getting Hebrew in the /u+UTF-16LE route.
Summary: Command Prompt + non-ASCII == almost certain fail. Give up and find some other interface you can use that supports Unicode better.

You should convert file.txt to UTF-16(Little Endian) before type file.txt
Reference: What encoding/code page is cmd.exe using?

I presume you mean "Lucida Console" when you say "Lucida".
Using the charmap application I couldn't find any Hebrew characters in the font. I don't know if the font was more capable in earlier versions of Windows, but in Windows 7 there appears to be nothing outside of the European characters.
My system also has Lucida Sans Typewriter which does include the Hebrew characters. Unfortunately the Cmd window doesn't show it as a choice. You need to edit the registry to open up more choices, as shown in this question on SuperUser: https://superuser.com/questions/5035/how-to-change-the-windows-console-font
P.S. I have been unable to verify this solution because Windows is being difficult. See https://superuser.com/questions/390933/how-to-add-a-font-to-the-cmd-window-choices-in-windows-7-64-bit

How to get an Hebrew enabled XP installation?
First of all, this is about an XP home SP3, Hebrew enabled. By that I mean it is a standard XP US installation, or so I believe, with the addition of Hebrew capabilities for keyboard and display. I believe every XP CD can install such a system. In particular, I believe the following is all that is needed for such a system:
Control panel -> Date, Time, Language and Regional Options -> Language and Regional Options -> in Language tab:
1) Click Details and add an Hebrew keyboard.
2) mark with a V the Install files for complex script and right-to-left languages (including Thai) option.
Control panel -> Date, Time, Language and Regional Options -> Language and Regional Options -> in Advanced tab:
Accept, mark with a V, 10004 (MAC - Arabic) and 10005 (Mac - Hebrew). Not sure if Arabic is a must have here.
Now to the cmd console
One has to explicitly add Courier New fonts to the console fonts registry, as described earlier. Otherwise, explicit Hebrew fonts will not be displayed.
Now when cmd console is opened, all there is to do in order to input Hebrew characters is to enable the Courier New fonts, and change the keyboard to an Hebrew mode. Having Windows scroll the languages it has for the keyboard is easy. Either repetitive pressing of left Alt combined with left shift keys, or with the mouse.
As an aside, a dir command will show file names that have Hebrew characters. However, one can't just issue a
dir file_name
and see the usual output if the file begins with a Hebrew letter. It must be
dir *file_name
I assume the asterisk character adds the BOM unicode character.
One can also open Notepad, input Hebrew characters, save the file as UTF8, and run the following in the console commands:
chcp 65001
type that_Notepad_file_I_saved
Saving the file as UTF8 is done on Notepad save screen.

Related

How to print chinese characters to the windows console correctly?

What I want to do is to print Chinese log messages to the windows console.
I did surveying the google and knew that it's related to Chinese font.
But when after setting the Chinese font, at this time incorrect Chinese characters were printed.
I hope to teach me how to solve this if you have some experiences before.
You can simply reproduce this situation.
Create new txt file and write to this "echo '你好!'",and save this file as 'test.bat'
Run this bat file in the console.
What I want to do is print this Chinese character correctly.
All you need to do is set the appropriate code page in the console before running the bat file containing Chinese text. Code page 936 is for simplified Chinese, so call chcp 936, and then run test.bat.
In my case (on Windows 10) I didn't need to worry about the font being used, since chcp 936 automatically changed the font to one that can render Chinese characters: NSimSun.
Depending on your requirements, it might be preferable to include the call to chcp 936 at the start of the bat file instead.

Displaying unicode characters in Windows 10 cmd

I want to type and print in windows 10 CMD sinhala unicode characters. but it just display question mark surrounded by a square for each sinhala character i type.
Is there any mechanism to display exact unicode characters in windows console?
Try modifying the registry settings for the cmd console (run regedit). Unfortunately, I am uncertain exactly which value you should enter for the font family, since it is a number.
The screen shot below shows my registry settings for a font of 'Courier New', which somehow translates to 30 (hexidecimal, 48 in base 10) in the registry. Hopefully you can experiment some and determine what number corresponds to a Sinhala font you have installed on your machine.
Additionally, you can select fonts using the cmd window's property dialog, illustrated in the screen shot below. Possibly you already have a font installed that you can use:
You've probably already done 1-3 since you can already type Sinhala, but you need a supporting font. Try the following:
Go to Region & language settings.
Add a language and select, Sinhala.
Click the language, Select Options, and you can select a keyboard type.
For Chinese, I was able to add a language pack, which gave me console fonts that support Chinese. I don't see that option for Sinhala. You may have to manually install a monospace font that support Sinhala. I couldn't find one, but if you do, this answer explains how to install it.

Why do Netbeans, Aptana Studio and Komodo Edit all not save in UTF-8?

I'm getting back into development and want to find a good editor for HTML5/JQuery.
Being able to save files in UTF-8 is important.
However, although I set my project in NetBeans 7.0 to encode in UTF-8, when I create a file in the project, then look at it in Notepad++, the file is encoded in ANSI and I have to manually set the encoding to UTF-8:
In Aptana Studio 3 I set the workspace to UTF-8 encoding, and my project inherits from that, but when I create a file in the project and look at it in Notepad++, it is encoded in ANSI and I have to change the encoding manually to UTF-8:
So I tried Komodo Edit 7 and in the file manually set the encoding to UTF-8, saved the file, looked at it in Notepad++ which said the file is in ANSI.
I notice in any of these editors if I put a German umlaut character in the file, then Notepad++ shows it as "ANSI as UTF-8" but I still have to manually change it to UTF-8 in Notepad++ where it will stay.
The reason I want an editor that saves in UTF-8 is I remember having a project a couple years ago which had German and French characters in the files and after they were viewed and saved in various editors, the characters would be replaced with garbage characters. The solution was to always initially set the encoding of the file to UTF-8.
I assumed that editors would be so far advanced now that if you specify that the files should be saved in UTF-8, that they actually save in UTF-8 in a way that is recognized by every modern text editor. Is this not the case? What am I not understanding about modern text editors and development environments in regard to UTF-8?
How can I get these editors to save their files in UTF-8 encoding?
A UTF-8 encoded file that only contains characters also present in the ASCII table (the first 128 Unicode characters, i.e. your basic alphanumeric characters) is indistinguishable from an ASCII/ANSI encoded file. My guess is that Notepad++ simply can't make the distinction (because there is none) and defaults to ANSI. You can see the difference when you include a character that is not in the ASCII table. By "ANSI as UTF-8" I can only guess that it means "this documents contains characters from the ANSI table (a.k.a. Latin-1) and is saved in UTF-8".
In other words, your IDEs are probably fine, the problem is with Notepad++.
Try a character like 漢字, that will result in a pretty unique UTF-8 byte sequence that's most certainly not ANSI.
From what I've seen on this topic, Notepad's UTF-8 equates to Notepad++'s UTF-8, which means with BOM included. If a file is saved with this encoding and opened in NetBeans, it will actually show a - character or the  characters for the BOM sequence (depending on whether the encoding for the project or IDE is set to UTF-8.) But if you save the file in Notepad++ encoded as "UTF-8 without BOM", and have either your project defined as UTF-8 or have your netbeans_default_options included with this -J-Dfile.encoding=UTF-8, you'll see what I think is UTF-8 as it should be. Unfortunately, if you try to edit this file in NetBeans without including characters that are outside of the ANSI code set, you see the behavior that you referred to in your question with the file having its encoding set to ANSI.
So in an attempt to make this a "sort-of" answer to your question, please remember that not all editor's concept of UTF-8 are the same. Notepad++ gives the most actual info on what the real encoding for a file is. I'd say that developing in either a Linux or Mac environment might be a possible good choice for making sure that localization is correct, but on Windows a decent workaround might be to just include a non-ANSI character in the file to insure it always get saved as a UTF-8 (non-BOM) file. This is all geared towards NetBeans dev by the way. I haven't tested this with the others, though I'm willing to bet that they will save the file correctly on a Windows machine if they have non-ANSI characters in them. Sorry for the kluge gang, but either way, I hope it helps someone struggling with this same issue.

TextPad and Unicode: full support?

I've got some UTF-8 files created in Mac, and when trying to open them using TextPad in Windows, I get the following warning:
WARNING: (file name) contains characters that do not exist in code
page 1252 (ANSI Latin 1). They will be converted to the system default
character, if you click OK.
Linux (GNOME gEdit) can open the same file without complaints. What does the above mean? I thought that TextPad had full UTF-8 support. Can I safely open and edit UTF-8 files using it without corrupting the file?
It seems that TextPad cannot handle characters outside windows-1252 (CP1252, here carrying the misnomer “ANSI Latin 1”). I tested it on Windows, opening a plain text file created on the same system, as UTF-8 encoded, both with and without BOM, with the same result. The program’s help does not seem to contain anything related to character encodings, and its tools for writing “international characters” are for Latin-1 characters only.
There are several text editors for Windows that can deal with UTF-8 (even Notepad can open a UTF-8 file, but it can hardly be recommended for serious editing). See Alan Wood’s collection of information on Unicode editors and word processors for Windows. (Personally, I like Notepad++ and BabelPad, which are both free.)
TextPad 8, the newest as of 2016-01-28, does finally properly support BMP Unicode. It's a paid upgrade, but so far has been working flawlessly for me.
TextPad ‘supports’ UTF-8 and UTF-16 documents only in as much as it will import and export them. But it still edits files as simple bytes, and not Unicode characters (using the ANSI code page, which is code page 1252 for Western European).
So unless the file happened to contain only characters that also exist in that code page, you will lose content. This rather defeats the point of Unicode.
Indeed, this was the issue that made me flee—to EmEditor, at the time, though now I would agree with the previous comments and recommend Notepad++. The era of paying for text editors is long gone.
Actually TextPad does support displaying Unicode code points granted they went about it the wrong way. In order to display the Unicode characters you have to choose Configure->Preferences and expand "Document Classes->Text->Font.
You need to choose a Unicode font AND set the Script to match. E.g. Arial Unicode MS with script CHINESE_BIG5.
However, this is a backward approach since the application should handle this when the user tells TextPad to open the file in Unicode or UTF-8. The built in Notepad application with MS Windows will detect the encoding automatically and display the glyphs correctly based upon the encoding.
I found a discussion on this in the Textpad forums:
http://forums.textpad.com/viewtopic.php?t=11019
While I have Notepad++, Textpad handles large files with ease while other editors I've tried, including Notepad++, either slow to a crawl or die. I'm currently trying to edit a 475MB file and Notepad++ is not up to the task.
Textpad Configure Menu --> Preferences --> Document Classes --> Default --> Default encoding --> UTF-8
Try the ANSI code set with File/Open, that should solve the problem in TextPad

Force Visual Studio (2010) to save all files in UTF-8

Is there any way I can force Visual Studio (2010) to save all files in UTF-8, always?
I do not know of a way to force it to save everything in UTF-8, but you can do so on a case-by-case basis. When you first save a document and the Save As... dialog appears, the Save button will actually be a drop-down button with two options. You want "Save with Encoding...", which will then present you the entire list of installed Windows encodings.
The encoding you really want is way down the bottom:
Unicode (UTF-8 without signature) - Codepage 65001
although if you want to save yourself a lot of pain, you will probably want to pick the option near the top:
Unicode (UTF-8 with signature) - Codepage 65001
The difference is that the latter option stick the UTF-8 signature (which is just the UTF-16 byte-order mark encoded in UTF-8). This is one of my pet peeves, as UTF-8 doesn't have multiple byte orders, so the BOM is redundant at best, and breaks all kinds of text processing tools at worst. MS uses it to "detect" UTF-8 automatically, since for single-byte character, UTF-8, ISO-8859-1, and CP-1252 are identical except for a sequence of 32 characters (0x80 - 0x9f) that MS basically made up.
If you only ever edit or process your files with Visual Studio or the .NET tools, then saving with signature will probably work fine. If you need to save files for use by other tools (batch files, SQL queries, PHP scripts, etc), the signature will cause problems, and you should save them without it. If you do this, you may want to enable the option (Under Tools -> Options -> Text Editor) to "Auto-detect UTF-8 encoding without signature", or else, right-click on the file and chose "Open With..." and select the editor option that says " editor with Encoding".
I think it saves files in the current codepage. There's an option under Tools->Options->Environment->Documents that will make it save in unicode when it cannot save in current codepage. But I don't know if that helps...
I think you want to try ForceUtf8(with BOM)/ ForceUtf8(without BOM) extenstion.
Just search UTF8 on VS extension gallery(Tools -> Extension and updates)

Resources