Python3.7 & Windows : incorrect unicode characters in docstrings in interactive mode - windows

Save following program as test.py:
def f():
"""àâùç"""
return
print("àâùç")
and execute it in a Windows cmd-window in interactive mode:
python -i test.py
The printed text is correct, but when I call help(f) I get scrambled eggs:
P:\>python -i test.py
àâùç
>>> help(f)
Help on function f in module __main__:
f()
ÓÔ¨þ
Changing the codepage to 65001 brings up classical mystery cards instead:
P:\>python -i test.py
àâùç
>>> help(f)
Help on function f in module __main__:
f()
����
Is there any (easy) workaround ?

help() has two bugs where the implementation of the pager is to write to a temp file and shell out to more. From pydoc.py:
def tempfilepager(text, cmd):
"""Page through text by invoking a program on a temporary file."""
import tempfile
filename = tempfile.mktemp()
with open(filename, 'w', errors='backslashreplace') as file:
file.write(text)
try:
os.system(cmd + ' "' + filename + '"')
finally:
os.unlink(filename)
The file is opened with default file encoding (cp1252 on U.S. and Western European Windows) which won't support characters outside the Windows-1252 character set (don't make Chinese help documentation, for example), and then shells out to a command (in this case, more) to handle paging. more uses the encoding of the terminal (OEM ANSI: default cp850 in Western Europe and cp437 in the US) so help will look corrupt for most characters outside the ASCII set.
Changing the terminal code page with chcp 1252 will print the characters correctly:
C:\>chcp 850
Active code page: 850
C:\>py -i test.py
àâùç
>>> help(f)
Help on function f in module __main__:
f()
ÓÔ¨þ
>>> ^Z
C:\>chcp 1252
Active code page: 1252
C:\>py -i test.py
àâùç
>>> help(f)
Help on function f in module __main__:
f()
àâùç
>>>

Related

ANSI escape code. SGR 10-19. Alternative fonts

Please help me understand the SGR parameters of ANSI escape code from 10 to 19
\033[12m
The sources only indicate about them:
SGR parameter
Name
Note
11-19
Alternative font
Select alternative font n − 10
I would like to find out and figure out what alternative fonts are and how these ANSI commands work.
Operating system - Windows 11. File .py run in the default Command Line. Python code:
print('\033[10m', end='')
print('font0 TEST123test')
print('\033[11m', end='')
print('font1 TEST123test')
print('\033[12m', end='')
print('font2 TEST123test')
print('\033[0;33m', end='')
print('TEST123test')
print('\033[1;33m', end='')
print('TEST123test')
Result of code execution: terminal
I believe that fonts should be changed to some alternative ones, but I don't understand what and how this should happen, and what else should I configure.

Difficulty with dealing with Unicode from sys.stdin

This is driving my somewhat nutty at the moment. It is clear from my last days of research that unicode is a complex topic. But here is behavior that I do not know how to address.
If I read a file with non-ASCII characters from disk and wrtie it back to file everything works as planned. however, when I read the same file from sys.stdin, id does not work and the the non-ASCII characters are not encoded properly. The sample code is here:
# -*- coding: utf-8 -*-
import sys
with open("testinput.txt", "r") as ifile:
lines = ifile.read()
with open("testout1.txt", "w") as ofile:
for line in lines:
ofile.write(line)
with open("testout2.txt", "w") as ofile:
for line in sys.stdin:
ofile.write(line)
The input file testinput.txt is this:
を
Sōten_Kōro
when I run the script from command line as cat testinput.txt | python test.py I get the following output respectively:
testout1.txt:
を
Sōten_Kōro
testout2.txt:
???
S??ten_K??ro
Any ideas how to adress this would be of great help. Thanks. Paul.
The reason is that you took a short cut, which should never been taken.
You should always define an encoding. So when you read the file, you should specify that you are reading UTF-8, or whenever. Or just make explicit that you are reading binary files.
In your case, python interpreter will use UTF-8 as standard encoding when reading from files, because this is the default in Linux and macos.
But when you read from standard input, the default is defined by the locale encoding, or by the environment variable.
I refer to How to change the stdin encoding on python on how to solve. This answer is just to explain the cause.
Thanks for the pointers. I have landed on the following implementation based on #GiacomoCatenazzi's answer and reference:
# -*- coding: utf-8 -*-
import sys
import codecs
with open("testinput.txt", "r") as ifile:
lines = ifile.read()
with open("testout1.txt", "w") as ofile:
for line in lines:
ofile.write(line)
UTF8Reader = codecs.getreader('utf-8')
sys.stdin = UTF8Reader(sys.stdin)
with open("testout2.txt", "w") as ofile:
for line in sys.stdin:
ofile.write(line.encode('utf-8'))
I am however not sure why it is necessary to encode again after using codecs.getreader?
Paul

Encoding issue on subprocess.Popen args

Yet another encoding question on Python.
How can I pass non-ASCII characters as parameters on a subprocess.Popen call?
My problem is not on the stdin/stdout as the majority of other questions on StackOverflow, but passing those characters in the args parameter of Popen.
Python script used for testing:
import subprocess
cmd = 'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'
process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
output, err = process.communicate()
result = process.wait()
print result, '-', output
For this example call, the script.py receives Testç on ã and ê. If I copy-paste this same command string on a CMD shell, it works fine.
What I've tried, besides what's described above:
Checked if all Python scripts are encoded in UTF-8. They are.
Changed to unicode (cmd = u'...'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 5 (Popen call).
Changed to cmd = u'...'.decode('utf-8'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 3 (decode call).
Changed to cmd = u'...'.encode('utf8'), results in Testç on ã and ê
Added PYTHONIOENCODING=utf-8 env. variable with no luck.
Looking on tries 2 and 3, it seems like Popen issues a decode call internally, but I don't have enough experience in Python to advance based on this suspicious.
Environment: Python 2.7.11 running on an Windows Server 2012 R2.
I've searched for similar problems but haven't found any solution. A similar question is asked in what is the encoding of the subprocess module output in Python 2.7?, but no viable solution is offered.
I read that Python 3 changed the way string and encoding works, but upgrading to Python 3 is not an option currently.
Thanks in advance.
As noted in the comments, subprocess.Popen in Python 2 is calling the Windows function CreateProcessA which accepts a byte string in the currently configured code page. Luckily Python has an encoding type mbcs which stands in for the current code page.
cmd = u'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'.encode('mbcs')
Unfortunately you can still fail if the string contains characters that can't be encoded into the current code page.

Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

I need to decode PowerShell stdout called from Python into a Python string.
My ultimate goal is to get in a form of a list of strings the names of network adapters on Windows. My current function looks like this and works well on Windows 10 with the English language:
def get_interfaces():
ps = subprocess.Popen(['powershell', 'Get-NetAdapter', '|', 'select Name', '|', 'fl'], stdout = subprocess.PIPE)
stdout, stdin = ps.communicate(timeout = 10)
interfaces = []
for i in stdout.split(b'\r\n'):
if not i.strip():
continue
if i.find(b':')<0:
continue
name, value = [ j.strip() for j in i.split(b':') ]
if name == b'Name':
interfaces.append(value.decode('ascii')) # This fails for other users
return interfaces
Other users have different languages, so value.decode('ascii') fails for some of them. E.g. one user reported that changing to decode('ISO 8859-2') works well for him (so it is not UTF-8). How can I know encoding to decode the stdout bytes returned by call to PowerShell?
UPDATE
After some experiments I am even more confused. The codepage in my console as returned by chcp is 437. I changed the network adapter name to a name containing non-ASCII and non-cp437 characters. In an interactive PowerShell session running Get-NetAdapter | select Name | fl, it correctly displayed the name, even its non-CP437 character. When I called PowerShell from Python non-ASCII characters were converted to the closest ASCII characters (for example, ā to a, ž to z) and .decode(ascii) worked nicely. Could this behaviour (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.
The output character encoding may depend on specific commands e.g.:
#!/usr/bin/env python3
import subprocess
import sys
encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))
Output
cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'
✌ (U+270C) character is received successfully.
The character encoding of the child script is set using PYTHONIOENCODING envvar inside the PowerShell session. I've chosen utf-32 for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.
Notice that the stdout encoding of the parent Python script is OEM code page (cp437 in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252) is used by default in Python 3.
To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding temporarily (inspired by #eryksun's comments):
#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE
char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
$old = [Console]::OutputEncoding
[Console]::OutputEncoding = [Text.Encoding]::UTF8
echo $([char]0x{char:04x}) | fl
echo $([char]0x{char:04x}) | tee {filename}
[Console]::OutputEncoding = $old'''.format(**vars())],
stdout=PIPE) as process:
print(sys.stdout.encoding)
for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))
Output
cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'
Both fl and tee use [Console]::OutputEncoding for stdout (the default behavior is as if | Write-Output is appended to the pipelines). tee uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.
$OutputEncoding is used to decode bytes in the middle of a pipeline:
#!/usr/bin/env python3
import subprocess
cmd = r'''
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])
Output
b'\xf0\x9f\x98\x8a\r\n'
that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'. With the default $OutputEncoding (ascii) we would get b'????\r\n' instead.
Note:
b'\n' is replaced with b'\r\n' despite using binary API such as os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) has no effect here)
b'\r\n' is appended if there is no newline in the output:
#!/usr/bin/env python3
from subprocess import check_output
cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"''' # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped: {piped}\nno pipe: {no_pipe}'.format(**vars()))
Output:
piped: b'no newline in the input\r\n'
no pipe: b'no newline in the input'
The newline is appended to the piped output.
If we ignore lone surrogates then setting UTF8Encoding allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore" is configured.
In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.
If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.
When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.
It might be due to System.Text.InternalDecoderBestFitFallback set for [Console]::OutputEncoding -- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?' is used instead of the original character).
Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.
If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.
It's a Python 2 bug already marked as wontfix: https://bugs.python.org/issue19264
I must use Python 3 if you want to make it work under Windows.

Replacing "\xe9" character from a unicode string in Python 3

Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.

Resources