The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.
Related
In Lua 5.4, I tried to print sone strings in Latin1 encoding with io.write(), but some characters (à,é...) are not well printed,
How could I perform this ?
Here is a screenshot of failed print with win-125x.lua
I guess you are running Lua on Windows.
Because you are converting Latin1 characters to UTF8, you should set the Windows console codepage to UTF8 before running your Lua script, with the following command :
chcp 65001
An other option is to save your script with UTF8 encoding without the need to convert strings from cp1252 to UTF8 and use the chcp command before running your script.
Remember that standard Lua has no concept of string encoding and that Windows support for UTF8 characters in the console is incomplete. Hence this kind of problems.
Check that related question too : Problem with accents while copying files in LUA
If you have the table utf8 you can do...
> do io.write(utf8.char(8364):rep(3)..'\n'):flush() end
€€€
To get the code you can do...
> do io.write(utf8.codepoint('€')..'\n'):flush() end
8364
But i am not sure if that works on windows.
...i am on linux.
I have the file t2ű.cmd on Windows with an accented character in its name, and I'd like to run it from Python 2 code.
Opening the file (open(u't2\u0170.cmd')) works if I pass the filename as a unicode literal, but no str literal works, because \u0170 is not on the code page of Windows. (See this question for more on opening files with accented characters in their name: opening a file with an accented character in its name, in Python 2 on Windows.)
Running the file from the Command Prompt without Python works.
I tried passing an str literal to os.system, os.popen, os.spawnl and subprocess.call (both with and without the shell), but it wasn't able to find the file.
These don't work, they raise UnicodeDecodeError: 'ascii' codec can't encode character u'\u170'...:
os.system(u't2\u170.cmd')
os.popen(u't2\u170.cmd')
os.spawnl(u't2\u170.cmd', u't2')
subprocess.call(u't2\u170.cmd')
subprocess.call(u'"t2\u170.cmd"')
subprocess.call([u't2\u170.cmd'])
In this project it's not feasible to upgrade to Python 3.
It's not feasible to rename the file, because these files can have arbitrary (user-supplied) names on a read-only share, and also the directory name can contain accented characters.
In C I would use any of the wsystem, wpopen or wspawnl functions in <process.h>.
Preferably I'm looking for a solution which works with the standard Python modules (no need to install packages). But I'm interested in any solution.
I need a solution which doesn't open a new window.
Eventually I want to pass command-line arguments to program, and the arguments will contain arbitrary Unicode characters.
This is based on the comment by #eryksun.
We need to call the system call CreateProcessW or the C functions wspawnl, wsystem or wpopen. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes could be a solution.
The C functions CreateProcessA, spawnl, system and popen don't work.
As described in the pep 0263, if you want to use unicode characters in a python script, just add a # -*- coding: utf-8 -*- at the beginning of your script (it's ok after the she-bang):
#!/bin/env python
# -*- coding: utf-8 -*-
import os
os.system('t2ű.cmd')
If you still find problems, you may take a look on some packages, like win-unicode-console.
It should work now directly, with no escaping code.
I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?
Tell Python to use UTF-8 on the log file. In Python 3 you do this by:
open(..., encoding='utf-8')
If you want to convert an MBCS string to UTF-8 you can switch string encodings:
filename.encode('mbcs').decode('utf-8')
Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.
Just change the encode to 'latin-1' (encoding='latin-1')
Using pure Python:
open(..., encoding = 'latin-1')
Using Pandas:
pd.read_csv(..., encoding='latin-1')
I have two files in a windows folder. Using the technique described here I found out that one file encoding is ANSI and another one is UTF-8.
However, If I open cmd or Powershell and try to get the encoding in IRB with the following code I get always "CP850":
File.open(file_name).read.encoding.name # => CP850
or
File.open(file_name).external_encoding.name # => CP850
Notepad++ also gives me that one file is ANSI and another is UTF-8.
How can I get the proper encoding using Ruby in Windows?
It is impossible to tell what encoding a file is, but it's possible to make an educated guess.
When you open a file, ruby simply assumes it's encoded with the default 8-bit encoding (in your case CP850).
See Detect encoding
and What is ANSI format? about ANSI
I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.