MBCS to UTF-8: How to encode in Python - windows

I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?

Tell Python to use UTF-8 on the log file. In Python 3 you do this by:
open(..., encoding='utf-8')
If you want to convert an MBCS string to UTF-8 you can switch string encodings:
filename.encode('mbcs').decode('utf-8')
Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.

Just change the encode to 'latin-1' (encoding='latin-1')
Using pure Python:
open(..., encoding = 'latin-1')
Using Pandas:
pd.read_csv(..., encoding='latin-1')

Related

LUA : How to print a Latin1 string with io.write()?

In Lua 5.4, I tried to print sone strings in Latin1 encoding with io.write(), but some characters (à,é...) are not well printed,
How could I perform this ?
Here is a screenshot of failed print with win-125x.lua
I guess you are running Lua on Windows.
Because you are converting Latin1 characters to UTF8, you should set the Windows console codepage to UTF8 before running your Lua script, with the following command :
chcp 65001
An other option is to save your script with UTF8 encoding without the need to convert strings from cp1252 to UTF8 and use the chcp command before running your script.
Remember that standard Lua has no concept of string encoding and that Windows support for UTF8 characters in the console is incomplete. Hence this kind of problems.
Check that related question too : Problem with accents while copying files in LUA
If you have the table utf8 you can do...
> do io.write(utf8.char(8364):rep(3)..'\n'):flush() end
€€€
To get the code you can do...
> do io.write(utf8.codepoint('€')..'\n'):flush() end
8364
But i am not sure if that works on windows.
...i am on linux.

Python 3 not aware of Windows filename encodings?

The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.

File encoding using ruby in windows

I have two files in a windows folder. Using the technique described here I found out that one file encoding is ANSI and another one is UTF-8.
However, If I open cmd or Powershell and try to get the encoding in IRB with the following code I get always "CP850":
File.open(file_name).read.encoding.name # => CP850
or
File.open(file_name).external_encoding.name # => CP850
Notepad++ also gives me that one file is ANSI and another is UTF-8.
How can I get the proper encoding using Ruby in Windows?
It is impossible to tell what encoding a file is, but it's possible to make an educated guess.
When you open a file, ruby simply assumes it's encoded with the default 8-bit encoding (in your case CP850).
See Detect encoding
and What is ANSI format? about ANSI

Encode a movie with Unicode filename in Windows using Popen

I want to encode a movie through IO.popen by ruby(1.9.3) in windows 7.
If the file name contains only ascii strings, encoding proceed normally.
But with unicode filename the script returns "No such file or directory" error.
Like following code.
#-*- encoding: utf-8 -*-
command = "ffmpeg -i ü.rm"
IO.popen(command){|pipe|
pipe.each{|line|
p line
}
}
I couldn't find whether the problem causes by ffmpeg or ruby.
How can fix this problem?
Windows doesn't use UTF-8 encoding. Ruby send the byte sequence of the Unicode filename to the file system directly, and of course the file system won't recognize UTF-8 sequences. It seems newer version of Ruby has fixed this issue. (I'm not sure. I'm using 1.9.2p290 and it's still there.)
You need to convert the UTF-8 filename to the encoding your Windows uses.
# coding: utf-8
code_page = "cp#{`chcp`.chomp[/\d+$/]}" # detect code page automatically.
command = "ffmpeg -i ü.rm".encode(code_page)
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
Another way is to save your script with the same encoding Windows uses. And don't forget to update the encoding declaration. For example, I'm using Simplified Chinese Windows and it uses GBK(CP936) as default encoding:
# coding: GBK
# save this file in GBK
command = "ffmpeg -i ü.rm"
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
BTW, by convention, it is suggested to use do...end for multi-line code blocks rather than {...}, unless in special cases.
UPDATE:
The underlying filesystem NTFS uses UTF-16 for file name encoding. So 가 is a valid filename character. However, GBK isn't able to encode 가, and so as to CP932 in your Japanese Windows. So you cannot send that specific filename to your cmd.exe and it isn't likely you can process that file with IO.popen. For CP932 compatible filenames, the encoding approach provided above works fine. For those filenames not compatible with CP932, it might be better to modify your filenames to a compatible one.

Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Resources