I am told that I can add images to a label, but when I run the following code I get an error message:
unicode error: unicodeescape codec can't decode bytes in position 2-3: truncated \UXXXXXX escape
My code is as simple as possible
from tkinter import *
root = Tk()
x = PhotoImage(file="C:\Users\user\Pictures\bee.gif")
w1 = Label(root, image=x).pack()
root.mainloop()
All the examples I've seen don't include the file path to the image but in that case Python can't find the image.
What am I doing wrong ??
Python is treating \Users as a unicode character because of the leading \U. Since it's an invalid unicode character, you get the error.
You can either use forward slashes ("C:/Users/user/Pictures/bee.gif"), a raw string (r"C:\Users\user\Pictures\bee.gif"), or escape the backslashes ("C:\\Users\\user\\Pictures\\bee.gif")
Related
When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
VSC faulty interprets the file with UTF-8 encoding
Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!
The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")
I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8. Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'. Currently, for every failing character, we have been providing a fix.
I am trying to understand if it is possible to create a file which has all possible UTF-8 characters? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'
Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are. For example:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"
This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax. You can then search your output for these.
If your data will be read by something that handles C-style escapes (\u####), you can also use:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"
An exhaustive list of all Unicode characters seems rather impractical for this use case. There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.
Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.
Some programming languages have existing libraries for this; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works. (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.)
Here's a quick and dirty Python attempt.
from unicodedata import normalize
def latinize(string):
"""
Map string to Latin-1, replacing characters which can be approximated
"""
result = []
for char in string:
try:
byte = normalize("NFKC", char).encode('latin-1')
except UnicodeEncodeError:
byte = normalize("NFKD", char).encode('ascii', 'ignore')
result.append(byte)
return b''.join(result)
def convert(fh):
for line in fh:
print(latinize(line), end='')
def main():
import sys
if len(sys.argv) > 1:
for filename in sys.argv[1:]:
with open(filename, 'r') as fh:
convert(fh)
else:
convert(sys.stdin)
if __name__ == '__main__':
main()
Demo: https://ideone.com/sOEBW9
Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.
Trying to get familiar with python's standard library and doing some mucking around with it on my Windows machine. Using python 2.7 I have the following little script which is intended to look in a directory and rename all of the files therein after removing numerals from the file name. I'm getting a typeerror that says "must be encoded string without NULL bytes, not str"
it calls out lines 5 and 18, noted below, where im using os.path.exists.
Any help would be greatly appreciated!
import os, re, string, glob
path = os.path.normpath('C:\Users\me\Photo Projects\Project Name\Project Photos\Modified\0-PyTest')
ln5:if os.path.exists(path):
print "path exists at " + path
for file in glob.glob(os.path.join(path, '*.jpg')):
new_path = os.path.join(os.path.dirname(file), re.sub('\d', '', os.path.basename(file)))
line18: if not os.path.exists(new_path):
os.rename(file, new_path)
"...Photos\Modified\0-PyTest"
Its taking the \0 as a null character. You have to escape \ using \\, or just put an r before the string to make it raw:
r'C:\Users\me\Photo Projects\Project Name\Project Photos\Modified\0-PyTest'
turns out to be the single backslash problem.
i thought os.path.normpath would format the path as required by the os.
If you are giving a path url just add r before it :
(r'E:\Images\1.png')
I am attempting to write a line of code that will take a line of japanese text and delete a certain set of characters. However I am having trouble with using unicode characters inside of the regular expression.
I am currently using text.gsub(/《.*?》/u, '') but I get the error
'gsub': invalid byte sequence in Windows-31J (Argument error)
Can anyone tell me what I am doing incorrectly?
Example text : その仕草《しぐさ》があまりに無造作《むぞうさ》だったので
Expected result: その仕草があまりに無造作だったので
Thanks
edit: # encoding: utf-8 is present at the top of the script.
Try this:
text.encode('utf-8', 'utf-8').gsub(/《.*?》/u, '')