Vim: Encoding (Unicode) in Terminal under Windows - windows

I don't know why, but this topic seems to be badly documented and is covered with controversies as nobody knows the real answer (except for maybe Mr. Moolenaar, who rarely answers anyway).
So basically I've raised a discussion here, and it went dead pretty quickly, probably because there are not too many people using Vim in terminal mode on Windows.
My encoding settings look as follows:
if has('multi_byte')
if empty(&termencoding)
let &termencoding = &encoding
endif
let &encoding = 'utf-8'
let &fileencoding = 'utf-8'
endif
Of course, I have no problems running under GVim: can type any characters, and my patched Consolas for Powerline works just fine. The problems start when I try to run Vim in terminal mode. I use ConEmu, a feature-rich terminal emulator for Windows. It claims to officially support Unicode out of the box. For example, I can run the following test script:
chcp 65001 & (cmd /c type "%~dpn0.cmd") & pause & goto :EOF
English: texts, web pages and documents
Graves,etc: à á â ã ä å æ ç è é ê ë ì í î ï
Greek: ΐ Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο
Arabic: ڠ ڡ ڢ ڣ ڤ ڥ ڦ ڧ ڨ ک ڪ ګ ڬ ڭ ڮ گ
Full width: @ A B C D E F G H I J K L M N O
Romanian: texte, pagini Web şi a documentelor
Vietnamese: văn bản, các trang web và các tài liệu
Russian: тексты, веб-страницы и документы
Japanese: テキスト、Webページや文書
Yiddish: טעקסץ, וועב זייַטלעך און דאָקומענטן
Hindi: पाठ, वेब पृष्ठों और दस्तावेज
Thai: ข้อความ หน้า เว็บ และ เอกสาร
Korean: 텍스트, 웹 페이지 및 문서
Chinese: 文本,網頁和文件
and I can see all the symbols correctly in ConEmu. Yes, the test script turns on the 65001 codepage. I've already discovered that Vim cannot work with the 65001 codepage at all, so this seems not to be an option anyway. The default codepage in the terminal is 437, and I can also type something like Russian in ConEmu with this default codepage, and it is displayed correctly.
Reading through :h termencoding, I see that Windows uses Unicode by default to pass symbols. Then, I don't understand why when I type anything non-ANSI in terminal Vim, I see ? symbols? Airline does not display fancy symbols from patched Consolas as well. How to configure true Unicode for terminal Vim on Windows? By the way, &termencoding reports 437 as well.
Could somebody, once and for all, please, explain to me is Unicode support for terminal Vim on Windows there (and how to configure it) or not?

I've wondered about this myself too and in the past tried ConEmu and gave up after struggling to get console vim with 256 colors and fancy fonts working on it.
So today I tried out for sometime again and surprise, surprise - things seem to be working. Given all the extreme sensitiveness to versions, I'm going to try and list down versions of everything
VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Aug 1 2014 09:38:34)
MS-Windows 32-bit console version
Included patches: 1-389
Compiled by raghuramanr#ADITI
ConEmu
140723 Alpha
Windows: Win 7x64
ConEmu settings in .vimrc:
" ConEmu
if !empty($CONEMUBUILD)
echom "Running in conemu"
set termencoding=utf8
set term=xterm
set t_Co=256
let &t_AB="\e[48;5;%dm"
let &t_AF="\e[38;5;%dm"
" termcap codes for cursor shape changes on entry and exit to
" /from insert mode
" doesn't work
"let &t_ti="\e[1 q"
"let &t_SI="\e[5 q"
"let &t_EI="\e[1 q"
"let &t_te="\e[0 q"
endif
Steps:
chcp 65001
vim.exe
I still can't get a blinking cursor in vim which is confusing. Still better than before when stuff would be messed up.

There was recently a patch for "Windows 8 IME in console Vim". It was cleaned up by mattn and posted here:
https://gist.github.com/mattn/8312677
It was included with 7.4.142. Does that version fix your issue?

Related

On Windows, PowerShell misinterprets non-ASCII characters in mosquitto_sub output

Note: This self-answered question describes a problem that is specific to using Eclipse Mosquitto on Windows, where it affects both Windows PowerShell and the cross-platform PowerShell (Core) edition, however.
I use something like the following mosquitto_pub command to publish a message:
mosquitto_pub -h test.mosquitto.org -t tofol/test -m '{ \"label\": \"eé\" }'
Note: The extra \-escaping of the " characters, still required as of Powershell 7.1, shouldn't be necessary, but that is a separate problem - see this answer.
Receiving that message via mosquitto_sub unexpectedly mangles the non-ASCII character é and prints Θ instead:
PS> $msg = mosquitto_sub -h test.mosquitto.org -t tofol/test; $msg
{ "label": "eΘ" } # !! Note the 'Θ' instead of 'é'
Why does this happen?
How do I fix the problem?
Problem:
While the mosquitto_sub man page makes no mention of character encoding as of this writing, it seems that on Windows mosquitto_sub exhibits nonstandard behavior in that it uses the system's active ANSI code page to encode its string output rather than the OEM code page that console applications are expected to use.[1]
There also appears to be no option that would allow you to specify what encoding to use.
PowerShell decodes output from external applications into .NET strings, based on the encoding stored in [Console]::OutputEncoding, which defaults to the OEM code page. Therefore, when it sees the ANSI byte representation of character é, 0xe9, in the output, it interprets it as the OEM representation, where it represents character Θ (the assumption is that the active ANSI code page is Windows-1252, and the active OEM code page IBM437, as is the case in US-English systems, for instance).
You can verify this as follows:
# 0xe9 is "é" in the (Windows-1252) ANSI code page, and coincides with *Unicode* code point
# U+00E9; in the (IBM437) OEM code page, 0xe9 represents "Θ".
PS> $oemEnc = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage OEMCP));
$oemEnc.GetString([byte[]] 0xe9)
Θ # Greek capital letter theta
Note that the decoding to .NET strings (System.String) that invariably happens means that the characters are stored as UTF-16 code units in memory, essentially as [uint16] values underlying the System.Char instances that make up a .NET string. Such a code unit encodes a Unicode character either in full, or - for characters outside the so-called BMP (Basic Multilingual Plane) - half of a Unicode character, as part of a so-called surrogate pair.
In the case at hand this means that the Θ character is stored as a different code point, namely a Unicode code point: Θ (Greek capital letter theta, U+0398).
Solution:
Note: A simple way to solve the problem is to activate system-wide support for UTF-8 (available in Windows 10), which sets both the ANSI and the OEM code page to 65001, i.e. UTF-8. However, this feature is (a) still in beta as of this writing and (b) has far-reaching consequences - see this answer for details.
However, it amounts to the most fundamental solution, as it also makes cross-platform Mosquitto use work properly (on Unix-like platforms, Mosquitto uses UTF-8).
PowerShell must be instructed what character encoding to use in this case, which can be done as follows:
PS> $msg = & {
# Save the original console output encoding...
$prevEnc = [Console]::OutputEncoding
# ... and (temporarily) set it to the active ANSI code page.
# Note: In *Windows PowerShell* - only - [System.TextEncoding]::Default work as the RHS too.
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))
# Now PowerShell will decode mosquitto_sub's output correctly.
mosquitto_sub -h test.mosquitto.org -t tofol/test
# Restore the original encoding.
[Console]::OutputEncoding = $prevEnc
}; $msg
{ "label": "eé" } # OK
Note: The Get-ItemPropertyValue cmdlet requires PowerShell version 5 or higher; in earlier version, either use [Console]::OutputEncoding = [System.TextEncoding]::Default or, if the code must also run in PowerShell (Core), [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP).ACP)
Helper function Invoke-WithEncoding can encapsulate this process for you. You can install it directly from a Gist as follows (I can assure you that doing so is safe, but you should always check):
# Download and define advanced function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex
The workaround then simplifies to:
PS> Invoke-WithEncoding -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }
{ "label": "eé" } # OK
A similar function focused on diagnostic output is Debug-NativeInOutput, discussed in this answer.
As an aside:
While PowerShell isn't the problem here, it too can exhibit problematic character-encoding behavior.
GitHub issue #7233 proposes making PowerShell (Core) windows default to UTF-8 to minimize encoding problems with most modern command-line programs (it wouldn't help with mosquitto_sub, however), and this comment fleshes out the proposal.
[1] Note that Python too exhibits this nonstandard behavior, but it offers UTF-8 encoding as an opt-in, either by setting environment variable PYTHONUTF8 to 1, or via the v3.7+ CLI option -X utf8 (must be specified case-exactly!).

How to change font properties in PowerShell 5?

I want to make the font as bold for the path that I'm printing using Write-Host. I'm flexible to using other methods like echo or something else.
I've tried other methods like Write-Debug, etc, also checked the module WindowsConsoleFonts.
But none of them supports font properties like making them bold or italic while printing them.
$pathString = "[" + (Get-Location) + "]"
Write-Host $pathString -ForegroundColor Cyan
I'm using PowerShell 5.1 which doesn't support MarkDown rendering, else I would have done it using Markdown.
You can achieve bold text via VT (Virtual Terminal) escape sequences.
However, regular Windows console windows (conhost.exe) do not support italics, and neither does their upcoming successor, Windows Terminal (at least as of this writing).[1]
In recent versions of Windows 10, support for VT sequences is enabled by default in both Windows PowerShell and PowerShell Core.
However, Write-Host has no support for them, so you must embed the escape sequences directly into the strings you pass to Write-Host (or strings you send to the success output stream, if it goes to the console):
Note:
I'm omitting Write-Host from the examples below, because it isn't strictly necessary, but colored text generally should indeed be written to the display (host), not to the success output stream.
While it is better to consistently use VT sequences for all formatting needs - including colors - it is possible to combine them with Write-Host -ForegroundColor /-BackgroundColor`.
PowerShell Core:
PowerShell Core supports embedding escape sequences directly in "..." (double-quoted strings), via the `e escape sequence, which expands to a literal ESC character, which is the character that initiates a VT escape sequence.
"To `e[1mboldly`e[m go ..., in `e[36mcyan`e[m."
Windows PowerShell:
There's no support for `e; the easiest solution is to use \e as placeholders and use -replace to substitute actual ESC characters (Unicode code point 0x1b) for them:
"To \e[1mboldly\e[m go ..., in \e[36mcyan\e[m." -replace '\\e', [char] 0x1b
[1] From PowerShell Core, you can run the following test command to see if the word italics prints in italics: "`e[3mitalics`e[m after"
Note that italics in the terminal are supported on macOS and at least in some Linux distros; e.g., Ubuntu 18.04

Vim conversion error on EUR-sign € and dash-sign - when saving the file

In (mac)vim I use the following .vimrc
set guifont=Meslo_LG_M_Regular_for_Powerline:h12
set encoding=utf-8
When I write a € sign in the buffer and try to save it, I always get an error
CONVERSION ERROR in line xxxx
This appears as well on some type of dashes. As workaround I have to replace the signs, e.g. € -> EUR and than it will work fine to save the file. But this is annoying.
How can I get vim to write the € so that I can save the file without error messages?
Thats what I use:
VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Apr 21 2014 14:54:22) MacOS X (unix) version
ProductName: Mac OS X
ProductVersion: 10.11.6
BuildVersion: 15G1510
With your 'encoding' setting, Vim is able to represent the € sign internally. Buffers loaded into Vim for editing still have a particular 'fileencoding'. If that one cannot represent the character (e.g. if it is latin1), you'll get the conversion error.
A quick fix is to force saving in UTF-8, e.g. via :w ++enc=utf-8.
Better adapt the global 'fileencodings' setting so the the file gets detected right from the start. Note: 'fileencodings' is the global option controlling the detection, 'fileencoding' is the actual, buffer-local encoding used.

Python 3 not aware of Windows filename encodings?

The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.

Ruby - Win32Console - using colors changes encoding?

I have recently installed gem Win32Console for my program. The program has Polish “interface”, which includes Polish special characters. Which works fine for every
puts "Ciekawym polskim słowem jest: żółć"
However, using escape characters in order to colorize the test (which works) seem to change the encoding and Windows 7 CMD displays such diacritic marks incorrectly:
green = "\e[1;32;40m"
puts "#{green}Ciekawym polskim słowem jest: żółć"
Honestly, with my limited knowledge of hot Ruby treats different encoding, I don't really even know where to start - is that a problem with Ruby, Win32Console or Command Prompt itself?
Windows console does not support ASCII escape sequence (\e[...) at all. (ANSI escape code - Wikipedia).
Turns out it was the gem I installed. I later found out that Ruby 2.0 and higher has built-in support for escape codes and it works just fine with UTF-8.

Resources