Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment - ruby

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...

What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Related

What Encoding does Ruby 1.9.3 use to parse the output of a shell command using backtics?

When executing
lines = `gpg --list-keys --with-colons horst`
What Encoding will the string lines have? How do I change how Ruby interprets it?
Background:
I have some Umlauts in some gpg keys, and I get this error when trying to split by newline:
invalid byte sequence in UTF-8
My current workaround is this:
lines.force_encoding('ISO-8859-1')
However, I don't get why this should be ISO-8859-1, as my locale is en_US.UTF-8..
I'm not sure if you still need an answer on this or not but it looks like you'll have to use the --display-charset or –charset option in your gpg command in order to set the name of the native character set. This is used to convert some strings to proper UTF-8 encoding. You shouldn't have to enforce encoding downstream after you've done that.
Check the gpg man page on your server to see which option is available to you.

why does my vim encoding and ruby encoding not agree?

In the ruby file:
p __ENCODING__
#<Encoding:US-ASCII>
In vim:
set encoding?
encoding=utf-8
This is causing me grief (http://stackoverflow.com/questions/14495486/ruby-syntax-error-with-multiple-language-in-hash), which is patched but I still don't understand why the file shows as ASCII by ruby and utf-8 by vim.
As #melpomene commented, :set encoding tells you what encoding is used internally by Vim.
:set fileencoding will tell you what encoding Vim decided to use for your document. The possible values are given by the fileencodings option. ASCII is not part of the default list as it's usually handled transparently by the other encodings listed.
But that part of your question is puzzling me:
but I still don't understand why the file is ASCII
because it looks like you actively want that file to be treated as ASCII by the interpreter.
Anyway, that encoding directive is only used by Ruby: it doesn't mean that the file is actually encoded as ASCII or that Vim is supposed to care about it and treat it in a special way.
In short, whether your file is actually encoded in ASCII or not, Vim doesn't care.
So… what do you want exactly? That vim sets its fileencoding option to ASCII when you open a supposedly ASCII file? That your supposedly ASCII file be converted to another encoding?
edit
With that directive, you explicitely tell Ruby that the file's content must be treated as ASCII and Ruby says "OK, that's ASCII, if you say so.".
This directive doesn't change anything to the actual encoding of the file. It could be utf-8, latin1 or whatever.
Vim doesn't understand that directive.
Vim chooses the encoding it uses for that file according to a number of rules you should read about in :h encoding, :h fileencoding and :h fileencodings.
Vim doesn't treat ASCII in a special "ASCII" way, it just handles it has the subset of utf-8 that it is.
So, before we go further, please verify:
the encoding of the file with something like $ file /path/to/file
the fileencoding Vim uses for that file with :set fileencoding

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

How does the magic comment ( # Encoding: utf-8 ) in ruby​​ work?

How does the magic comment in ruby​​ works? I am talking about:
# Encoding: utf-8
Is this a preprocessing directive? Are there other uses of this type of construction?
Ruby interpreter instructions at the top of the source file - this is called magic comment. Before processing your source code interpreter reads this line and sets proper encoding. It's quite common for interpreted languages I believe. At least Python uses the same approach.
You can specify encoding in a number of different ways (some of them are recognized by editors):
# encoding: UTF-8
# coding: UTF-8
# -*- coding: UTF-8 -*-
You can read some interesting stuff about source encoding in this article.
The only thing I'm aware of that has similar construction is shebang, but it is related to Unix shells in general and is not Ruby-specific.
magic_comments defined in ruby/ruby
This magic comment tells Ruby the source encoding of the currently parsed file. As Ruby 1.9.x by default assumes US_ASCII you have tell the interpreter what encoding your source code is in if you use non-ASCII characters (like umlauts or accented characters).
The comment has to be the first line of the file (or below the shebang if used) to be recognized.
There are other encoding settings. See this question for more information.
Since version 2.0, Ruby assumes UTF-8 encoding of the source file by default. As such, this magic encoding comment has become a rarer sight in the wild if you write your source code in UTF-8 anyway.
As you noted, magic comments are a special preprocessing construct. They must be defined at the top of the file (except, if there is already a unix shebang at the top). As of Ruby 2.3 there are three kinds of magic comments:
Encoding comment: See other answers. Must always be the first magic comment. Must be ASCII compatible. Sets the source encoding, so you will run into problems if the real encoding of the file does not match the specified encoding
frozen_string_literal: true: Freezes all string literals in the current file
warn_indent: true: Activates indentation warnings for the current file
More info: Magic Instructions
While this isn't exactly an answer for your question, if you want to read more about encodings, how they work, what kinds of problems crop up with them: the great Yehuda Katz wrote about encodings as they were being worked out in Ruby 1.9 and beyond:
Ruby 1.9 Encodings: A Primer and the Solution for Rails
Encodings, Unabridged

ruby 1.9 + sinatra incompatible character encodings: ASCII-8BIT and UTF-8

I'm trying to migrate a sinatra application to ruby 1.9
I'm using sinatra 1.0, rack 1.2.0 and erb templates
when I start sinatra it works but when I request the web page from the browser I get this error:
Encoding::CompatibilityError at /
incompatible character encodings: ASCII-8BIT and UTF-8
all .rb files has this header:
#!/usr/bin/env ruby
# encoding: utf-8
I think the problem is in the erb files even if it shows that it's UTF-8 encoded
[user#localhost views]$ file home.erb
home.erb: UTF-8 Unicode text
any one had this problem before? is sinatra not fully compatible with ruby 1.9?
I'm not familiar with the specifics of your situation, but this kind of error has come up in Ruby 1.9 when there's an attempt to concatenate a string in the source code (typically encoded in UTF-8) with a string from outside of the system, e.g., input from an HTML form or data from a database.
ASCII-8BIT is basically a synonym for binary. It suggests that the input string was not tagged with the actual encoding that has been used (for example, UTF-8 or ISO-8859-1).
My understanding is that exception messages are not seen in Ruby 1.8 because it treats strings as binary and silently concatenates strings of different encodings. For subtle reasons, this often isn't a problem.
I ran into a similar error yesterday and found this excellent overview.
http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
One option to get your error message to go away is to use force_encoding('UTF-8') (or some other encoding) on the string coming from the external source. This is not to be done lightly, and you'll want to have a sense of the implications.
I had the same issue. The problem was a utf8 encoded file which should be us-ascii.
I checked using the file command (on OSX):
$ file --mime-encoding somefile
somefile: utf-8
After removing the weird characters from the file:
$ file --mime-encoding somefile
somefile: us-ascii
This fixed the issue for me.

Resources