I have a script running on Ruby 1.9.1 on Windows 7
I've distilled my script down to
File.open("翻譯測試.txt")
and still can't get it to work. I know there are issues with Ruby 1.9 filename handling on windows (Using the Windows ANSI library), but would be happy enough with a work around that is callable from Ruby
Most of the Unicode changes like file and directory operations have been improved in 1.9.2 (trunk) and other bigger changes will be merged pretty soon.
As bobince pointed out, this was already asked:
Unicode filenames on Windows in Ruby
This should help you
string = "翻譯測試" # by default, string is encoded as "ASCII"
string.force_encoding("SHIFT-JIS") # retags the String as SHIFT-JIS or whatever UTF char set that #is in
Heres a nice read a bit about char encoings in 1.9.1
http://yehudakatz.com/2010/05/17/encodings-unabridged/
Related
In ruby one can freeze all constant strings in a file via two different magic comments at the beginning of a file:
# frozen_string_literal: true
and
# -*- immutable: string -*-
I have no idea what the differences are.
Are there any?
The 1st syntax is the magic comment for Ruby 2.3+ versions to freeze string literals, otherwise you have to use the String method like this:
'hello world!'.freeze
The 2nd syntax is not implemented in Ruby, however it is the way that variables are specified for files in the Emacs text editor.
For example, the following comment in Emacs would declare that the file is a Ruby file and needs Ruby syntax highlighting, and that the variable immutable is set to the value string.
# -*- mode: ruby; immutable: string -*-
After searching around, it looks like that does nothing and is not used by any Ruby syntax highlighting mode.
So you do not need the 2nd syntax.
Digging for anything on the 2nd version, it looks like they had the same intention but the 2nd magic comment syntax does not to appear to have been adopted as of Ruby 2.1.0.
See https://github.com/ruby/ruby/pull/487
The first version # frozen_string_literal: true was adopted in Ruby 2.3.0
I tried the latter version in a few versions of ruby but didn't work. I would guess it should not be used or trusted to work in any version of >= 2.3 but probably no versions support it. In fact, I was not able to find any reference to that version in the open source code on github searching that syntax
https://github.com/ruby/ruby/search?q=immutable%3A+string&unscoped_q=immutable%3A+string
The following code works well in Win7 until it crashes in the last print(f). It does it when it finds some "exotic" characters in the filenames, as the french "oe" as in œuvre and the C in Karel Čapek. The program crashes with an Encoding error, saying the character x in the filename is'nt a valid utf-8 char.
Should'nt Python3 be aware of the utf-16 encoding of the Windows7 paths?
How should I modify my code?
import os
rootDir = '.'
extensions = ['mobi','lit','prc','azw','rtf','odt','lrf','fb2','azw3' ]
files=[]
for dirName, subdirList, fileList in os.walk(rootDir):
files.extend((os.path.join(dirName,fn) for fn in fileList if any([fn.endswith(ext) for ext in extensions])))
for f in files:
print(f)
eryksun answered my question in a comment. I copy his answer here so the thread does'nt stand as unanswered, The win-unicode-console module solved the problem:
Python 3's raw FileIO class forces binary mode, which precludes using
a UTF-16 text mode for the Windows console. Thus the default setup is
limited to using an OEM/ANSI codepage. To avoid raising an exception,
you'd have to use a less-strict 'replace' or 'backslashreplace' mode
for sys.stdout. Switching to codepage 65001 (UTF-8) seems like it
should be the answer, but the console host (conhost.exe) has problems
with multibyte encodings. That leaves the UTF-16 wide-character API,
such as via the win-unicode-console module.
I have recently installed gem Win32Console for my program. The program has Polish “interface”, which includes Polish special characters. Which works fine for every
puts "Ciekawym polskim słowem jest: żółć"
However, using escape characters in order to colorize the test (which works) seem to change the encoding and Windows 7 CMD displays such diacritic marks incorrectly:
green = "\e[1;32;40m"
puts "#{green}Ciekawym polskim słowem jest: żółć"
Honestly, with my limited knowledge of hot Ruby treats different encoding, I don't really even know where to start - is that a problem with Ruby, Win32Console or Command Prompt itself?
Windows console does not support ASCII escape sequence (\e[...) at all. (ANSI escape code - Wikipedia).
Turns out it was the gem I installed. I later found out that Ruby 2.0 and higher has built-in support for escape codes and it works just fine with UTF-8.
I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.
I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)