Yet another encoding question on Python.
How can I pass non-ASCII characters as parameters on a subprocess.Popen call?
My problem is not on the stdin/stdout as the majority of other questions on StackOverflow, but passing those characters in the args parameter of Popen.
Python script used for testing:
import subprocess
cmd = 'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'
process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
output, err = process.communicate()
result = process.wait()
print result, '-', output
For this example call, the script.py receives Testç on ã and ê. If I copy-paste this same command string on a CMD shell, it works fine.
What I've tried, besides what's described above:
Checked if all Python scripts are encoded in UTF-8. They are.
Changed to unicode (cmd = u'...'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 5 (Popen call).
Changed to cmd = u'...'.decode('utf-8'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 3 (decode call).
Changed to cmd = u'...'.encode('utf8'), results in Testç on ã and ê
Added PYTHONIOENCODING=utf-8 env. variable with no luck.
Looking on tries 2 and 3, it seems like Popen issues a decode call internally, but I don't have enough experience in Python to advance based on this suspicious.
Environment: Python 2.7.11 running on an Windows Server 2012 R2.
I've searched for similar problems but haven't found any solution. A similar question is asked in what is the encoding of the subprocess module output in Python 2.7?, but no viable solution is offered.
I read that Python 3 changed the way string and encoding works, but upgrading to Python 3 is not an option currently.
Thanks in advance.
As noted in the comments, subprocess.Popen in Python 2 is calling the Windows function CreateProcessA which accepts a byte string in the currently configured code page. Luckily Python has an encoding type mbcs which stands in for the current code page.
cmd = u'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'.encode('mbcs')
Unfortunately you can still fail if the string contains characters that can't be encoded into the current code page.
Related
I have the file t2ű.cmd on Windows with an accented character in its name, and I'd like to run it from Python 2 code.
Opening the file (open(u't2\u0170.cmd')) works if I pass the filename as a unicode literal, but no str literal works, because \u0170 is not on the code page of Windows. (See this question for more on opening files with accented characters in their name: opening a file with an accented character in its name, in Python 2 on Windows.)
Running the file from the Command Prompt without Python works.
I tried passing an str literal to os.system, os.popen, os.spawnl and subprocess.call (both with and without the shell), but it wasn't able to find the file.
These don't work, they raise UnicodeDecodeError: 'ascii' codec can't encode character u'\u170'...:
os.system(u't2\u170.cmd')
os.popen(u't2\u170.cmd')
os.spawnl(u't2\u170.cmd', u't2')
subprocess.call(u't2\u170.cmd')
subprocess.call(u'"t2\u170.cmd"')
subprocess.call([u't2\u170.cmd'])
In this project it's not feasible to upgrade to Python 3.
It's not feasible to rename the file, because these files can have arbitrary (user-supplied) names on a read-only share, and also the directory name can contain accented characters.
In C I would use any of the wsystem, wpopen or wspawnl functions in <process.h>.
Preferably I'm looking for a solution which works with the standard Python modules (no need to install packages). But I'm interested in any solution.
I need a solution which doesn't open a new window.
Eventually I want to pass command-line arguments to program, and the arguments will contain arbitrary Unicode characters.
This is based on the comment by #eryksun.
We need to call the system call CreateProcessW or the C functions wspawnl, wsystem or wpopen. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes could be a solution.
The C functions CreateProcessA, spawnl, system and popen don't work.
As described in the pep 0263, if you want to use unicode characters in a python script, just add a # -*- coding: utf-8 -*- at the beginning of your script (it's ok after the she-bang):
#!/bin/env python
# -*- coding: utf-8 -*-
import os
os.system('t2ű.cmd')
If you still find problems, you may take a look on some packages, like win-unicode-console.
It should work now directly, with no escaping code.
The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.
Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.
I have an exe that I need to open using Python 3.2.3. I also need to pass an argument in the form of bytes to the exe. I try doing something like:
argument = '\x50'*260
subprocess.call([command, argument])
This works fine but when I try to give a non-printable character as the argument like '\x86', it gets converted to '\x3f'. Printing the argument gives the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\x86' in position 262: character maps to <undefined>
So I tried doing it using os.system:
command = "C:\myfile.exe "+b"\x50"*260
os.system(command)
But obviously, this leads to a type error. Does anyone have any suggestions to get this thing done?
It can't be done. The thing is that what subprocess does is pretend to type commands into the command prompt. Do you have access to the source of myfile.exe? You can easily represent bytes as a string or number.
I have an environment variable set in Windows as TEST=abc£ which uses Windows-1252 code page. Now, when I run a Perl program test.pl this environment value comes properly.
When I call another Perl code - test2.pl from test1.pl either by system(..) or Win32::Process, the environment comes garbled.
Can someone provide information why this could be and way to resolve it?
The version of perl I am using is 5.8.
If my understanding is right, perl internally uses utf-8, so the initial process - test1.pl received it right from Windows-1252 → utf-8. When we call another process, should we convert back to Windows-1252 code page?
This has nothing to do with Perl's internal string encoding, but with the need to properly decode data coming from the outside. I'll provide the test case. This is Strawberry Perl 5.10 on a Western European Windows XP.
test1.pl:
use Devel::Peek;
print Dump $ENV{TEST};
use Encode qw(decode);
my $var = decode 'Windows-1252', $ENV{TEST};
print Dump $var;
system "B:/sperl/perl/bin/perl.exe B:/test2.pl";
test2.pl:
use Devel::Peek;
print Dump $ENV{TEST};
use Encode qw(decode);
my $var = decode 'IBM850', $ENV{TEST};
# using Windows-1252 again is wrong here
print Dump $var;
Execute:
> set TEST=abc£
> B:\sperl\perl\bin\perl.exe B:\test1.pl
Output (shortened):
SV = PVMG(0x982314) at 0x989a24
FLAGS = (SMG, RMG, POK, pPOK)
PV = 0x98de0c "abc\243"\0
SV = PV(0x3d6a64) at 0x989b04
FLAGS = (PADMY, POK, pPOK, UTF8)
PV = 0x9b5be4 "abc\302\243"\0 [UTF8 "abc\x{a3}"]
SV = PVMG(0x982314) at 0x989a24
FLAGS = (SMG, RMG, POK, pPOK)
PV = 0x98de0c "abc\243"\0
SV = PV(0x3d6a4c) at 0x989b04
FLAGS = (PADMY, POK, pPOK, UTF8)
PV = 0x9b587c "abc\302\243"\0 [UTF8 "abc\x{a3}"]
You are bitten by the fact that Windows uses a different encoding for the text environment (IBM850) than for the graphical environment (Windows-1252). An expert has to explain the deeper details of that phenomenon.
Edit:
It is possible to heuristically (meaning it will fail to do the right thing sometimes, especially for such short strings) determine encodings. The best general purpose solution is Encode::Detect/Encode::Detect::Detector which is based on Mozilla nsUniversalDetector.
There are some ways to decode external data implicitely such as the open pragma/IO layers and the -C switch, however they deal with file streams and program arguments only. As of now, from the environment must be decoded explicitely. I like that better anyway, explicite shows the maintainance programmer you have thought the topic through.