Understanding encoding and decoding in Python - windows

I'm looking around how works encoding in python 2.7, and I can't quite understand some aspects of it. I've worked with files with different encodings, and yet so far I was doing okay. Until I started to work with certain API, and it requires to work with Unicode strings
u'text'
and I was using Normal strings
'text'
Which araised a lot of problems.
So I want to know how to go from Unicode String to Normal String and backwards, because the data that I'm working with is handled by Normal Strings, and I only know how to get the Unicode ones without having issues, over the Python Shell.
What I've tried is:
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
Now, to get an Unicode string what I do is:
>>> foobar = unicode(foo, "latin1")
u'gur\xa3'
But this doesn't work for me, since I'm doing some comparisons in my code like this:
>>> foobar in u"Foo gurú Bar"
False
Which fails, even if the original value is the same, because of the encoding.
[Edit]
I'm using Python Shell on Windows 10.

The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows application use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary!... Source
So if you want to go from normal String to Unicode and backwards. Then first you have to findout the encoding of your system, which is used for normal Strings in Python 2.X. And later on, use it to make the proper conversion.
I leave you with an example:
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
>>>
>>> foobar = unicode(foo, 'cp850')
u'gur\xfa'
>>>
>>> foobar in u"Foo gurú Bar"
True

Related

running a cmd file with an accented character in its name, in Python 2 on Windows

I have the file t2ű.cmd on Windows with an accented character in its name, and I'd like to run it from Python 2 code.
Opening the file (open(u't2\u0170.cmd')) works if I pass the filename as a unicode literal, but no str literal works, because \u0170 is not on the code page of Windows. (See this question for more on opening files with accented characters in their name: opening a file with an accented character in its name, in Python 2 on Windows.)
Running the file from the Command Prompt without Python works.
I tried passing an str literal to os.system, os.popen, os.spawnl and subprocess.call (both with and without the shell), but it wasn't able to find the file.
These don't work, they raise UnicodeDecodeError: 'ascii' codec can't encode character u'\u170'...:
os.system(u't2\u170.cmd')
os.popen(u't2\u170.cmd')
os.spawnl(u't2\u170.cmd', u't2')
subprocess.call(u't2\u170.cmd')
subprocess.call(u'"t2\u170.cmd"')
subprocess.call([u't2\u170.cmd'])
In this project it's not feasible to upgrade to Python 3.
It's not feasible to rename the file, because these files can have arbitrary (user-supplied) names on a read-only share, and also the directory name can contain accented characters.
In C I would use any of the wsystem, wpopen or wspawnl functions in <process.h>.
Preferably I'm looking for a solution which works with the standard Python modules (no need to install packages). But I'm interested in any solution.
I need a solution which doesn't open a new window.
Eventually I want to pass command-line arguments to program, and the arguments will contain arbitrary Unicode characters.
This is based on the comment by #eryksun.
We need to call the system call CreateProcessW or the C functions wspawnl, wsystem or wpopen. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes could be a solution.
The C functions CreateProcessA, spawnl, system and popen don't work.
As described in the pep 0263, if you want to use unicode characters in a python script, just add a # -*- coding: utf-8 -*- at the beginning of your script (it's ok after the she-bang):
#!/bin/env python
# -*- coding: utf-8 -*-
import os
os.system('t2ű.cmd')
If you still find problems, you may take a look on some packages, like win-unicode-console.
It should work now directly, with no escaping code.

Fix Filename that was changed to ASCII from UTF8

I recently downloaded a pack of videos that should have Japanese characters as their file names. Instead who ever uploaded them botched the formatting.
Instead of Kana, Hiragana, and Kanji I get;
002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4
I was wondering if there was a way to fix this short of asking for another upload?
I tried to put the names into a Text file and then hex edit that file to change it's encoding, but that didn't work.
I would use the chardet library for Python as an aid to guess at the encoding.
>>> import chardet
>>> s='002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4'
>>> chardet.detect(s.encode('l1'))
{'encoding': 'ISO-8859-5', 'confidence': 0.536359806931924, 'language': 'Russian'}
>>> chardet.detect(s.encode('cp437'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
>>> chardet.detect(s.encode('cp850'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
Probably not ISO-8859-1, more likely IBM 437 or 850.
>>> s.encode('cp850').decode('sjis')
'002撫⊃ペッティング(ブルマ).mp4'
>>> s.encode('cp437').decode('sjis')
'002撫○ペッティング(ブルマ).mp4'
Could be either one of these, but I can't read them.

Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

I need to decode PowerShell stdout called from Python into a Python string.
My ultimate goal is to get in a form of a list of strings the names of network adapters on Windows. My current function looks like this and works well on Windows 10 with the English language:
def get_interfaces():
ps = subprocess.Popen(['powershell', 'Get-NetAdapter', '|', 'select Name', '|', 'fl'], stdout = subprocess.PIPE)
stdout, stdin = ps.communicate(timeout = 10)
interfaces = []
for i in stdout.split(b'\r\n'):
if not i.strip():
continue
if i.find(b':')<0:
continue
name, value = [ j.strip() for j in i.split(b':') ]
if name == b'Name':
interfaces.append(value.decode('ascii')) # This fails for other users
return interfaces
Other users have different languages, so value.decode('ascii') fails for some of them. E.g. one user reported that changing to decode('ISO 8859-2') works well for him (so it is not UTF-8). How can I know encoding to decode the stdout bytes returned by call to PowerShell?
UPDATE
After some experiments I am even more confused. The codepage in my console as returned by chcp is 437. I changed the network adapter name to a name containing non-ASCII and non-cp437 characters. In an interactive PowerShell session running Get-NetAdapter | select Name | fl, it correctly displayed the name, even its non-CP437 character. When I called PowerShell from Python non-ASCII characters were converted to the closest ASCII characters (for example, ā to a, ž to z) and .decode(ascii) worked nicely. Could this behaviour (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.
The output character encoding may depend on specific commands e.g.:
#!/usr/bin/env python3
import subprocess
import sys
encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))
Output
cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'
✌ (U+270C) character is received successfully.
The character encoding of the child script is set using PYTHONIOENCODING envvar inside the PowerShell session. I've chosen utf-32 for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.
Notice that the stdout encoding of the parent Python script is OEM code page (cp437 in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252) is used by default in Python 3.
To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding temporarily (inspired by #eryksun's comments):
#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE
char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
$old = [Console]::OutputEncoding
[Console]::OutputEncoding = [Text.Encoding]::UTF8
echo $([char]0x{char:04x}) | fl
echo $([char]0x{char:04x}) | tee {filename}
[Console]::OutputEncoding = $old'''.format(**vars())],
stdout=PIPE) as process:
print(sys.stdout.encoding)
for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))
Output
cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'
Both fl and tee use [Console]::OutputEncoding for stdout (the default behavior is as if | Write-Output is appended to the pipelines). tee uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.
$OutputEncoding is used to decode bytes in the middle of a pipeline:
#!/usr/bin/env python3
import subprocess
cmd = r'''
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])
Output
b'\xf0\x9f\x98\x8a\r\n'
that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'. With the default $OutputEncoding (ascii) we would get b'????\r\n' instead.
Note:
b'\n' is replaced with b'\r\n' despite using binary API such as os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) has no effect here)
b'\r\n' is appended if there is no newline in the output:
#!/usr/bin/env python3
from subprocess import check_output
cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"''' # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped: {piped}\nno pipe: {no_pipe}'.format(**vars()))
Output:
piped: b'no newline in the input\r\n'
no pipe: b'no newline in the input'
The newline is appended to the piped output.
If we ignore lone surrogates then setting UTF8Encoding allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore" is configured.
In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.
If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.
When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.
It might be due to System.Text.InternalDecoderBestFitFallback set for [Console]::OutputEncoding -- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?' is used instead of the original character).
Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.
If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.
It's a Python 2 bug already marked as wontfix: https://bugs.python.org/issue19264
I must use Python 3 if you want to make it work under Windows.

Multibyte character issue with .match?

The following code is something I am beginning to test for use within a "Texas Hold Em" style game I am working on.
My question is why, when running the following code, does the puts involving a "♥" return a "\u" in it's place. I feel certain it is this multibyte character that is causing the issue becuse on the second puts , I replaced the ♦ with a d in the array of strings and it returned what i was expecting. See Below:
My Code:
#! /usr/bin/env ruby
# encoding: utf-8
table_cards = ["|2♥|", "|8♥|", "|6d|", "|6♣|", "|Q♠|"]
# Array of cards
player_1_face_1 = "8"
player_1_suit_1 = "♦"
# Player 1's face and suit of first card he has
player_1_face_2 = "6"
player_1_suit_2 = "♥"
# Player 1's face and suit of second card he has
test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)
# EX: Searching for match between face values on (player 1's |8♦|) and the |8♥| on the table
test_str_2 = /(\D6\D{2})/.match(table_cards.to_s)
# EX: Searching for match between face values on (player 1's |6♥|) and the |6d| on the table
puts "#{test_str_1}"
puts "#{test_str_2}"
Puts to Screen:
|8\u
|6d|
-- My goal would be to get the first puts to return: |8♥|
I am not so much looking for a solution to this (there may not even be one) but more so a "as simple as possible" explanation of what is causing this issue and why. Thanks ahead of time for any information on what is happening here and how I can tackle the goal.
The "\u" you're seeing is the Unicode string indicator.
For example, Unicode character 'HEAVY BLACK HEART' (U+2764) can be printed as "\u2764".
A friendly Unicode character listing site is http://unicode-table.com/en/sets/
Are you able to launch interactive Ruby in your shell and print a heart like this?
irb
irb> puts "\u2764"
❤
When I run your code in my Ruby, I get the answer you expect:
test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)
=> #<MatchData "|8♥|" 1:"|8♥|">
What happens if you try a regex that is more specific to your cards?
test_str_1 = /(\|8[♥♦♣♠]\|)/.match(table_cards.to_s)
In your example output, you're not seeing the Unicode heart symbol as you want. Instead, your output is printing the "\u" which is the Unicode starter, but then not printing the rest of the expected string which is "2764".
See the comment by the Tin Man that describes encoding for your console. If he's correct, then I expect the more-specific regex will succeed, but still print the wrong output.
See the comment by David Knipe that says it looks like it gets truncated because the regex only matches 4 characters. If he's correct, then I expect the more-specific regex will succeed and also print the right output.
(The rest of this answer is typical for Unix; if you're on Windows, ignore the rest here...)
To show your system language settings, try this in your shell:
echo $LC_ALL
echo $LC_CTYPE
If they are not "UTF-8" or something like that, try this in your shell:
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Then re-run your code -- be sure to use the same shell.
If this works, and you want to make this permanent, one way is to add these here:
# /etc/environment
LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
Then source that file from your .bashrc or .zshrc or whatever shell startup file you use.

Replacing "\xe9" character from a unicode string in Python 3

Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.

Resources