How does the magic comment ( # Encoding: utf-8 ) in ruby​​ work? - ruby

How does the magic comment in ruby​​ works? I am talking about:
# Encoding: utf-8
Is this a preprocessing directive? Are there other uses of this type of construction?

Ruby interpreter instructions at the top of the source file - this is called magic comment. Before processing your source code interpreter reads this line and sets proper encoding. It's quite common for interpreted languages I believe. At least Python uses the same approach.
You can specify encoding in a number of different ways (some of them are recognized by editors):
# encoding: UTF-8
# coding: UTF-8
# -*- coding: UTF-8 -*-
You can read some interesting stuff about source encoding in this article.
The only thing I'm aware of that has similar construction is shebang, but it is related to Unix shells in general and is not Ruby-specific.
magic_comments defined in ruby/ruby

This magic comment tells Ruby the source encoding of the currently parsed file. As Ruby 1.9.x by default assumes US_ASCII you have tell the interpreter what encoding your source code is in if you use non-ASCII characters (like umlauts or accented characters).
The comment has to be the first line of the file (or below the shebang if used) to be recognized.
There are other encoding settings. See this question for more information.
Since version 2.0, Ruby assumes UTF-8 encoding of the source file by default. As such, this magic encoding comment has become a rarer sight in the wild if you write your source code in UTF-8 anyway.

As you noted, magic comments are a special preprocessing construct. They must be defined at the top of the file (except, if there is already a unix shebang at the top). As of Ruby 2.3 there are three kinds of magic comments:
Encoding comment: See other answers. Must always be the first magic comment. Must be ASCII compatible. Sets the source encoding, so you will run into problems if the real encoding of the file does not match the specified encoding
frozen_string_literal: true: Freezes all string literals in the current file
warn_indent: true: Activates indentation warnings for the current file
More info: Magic Instructions

While this isn't exactly an answer for your question, if you want to read more about encodings, how they work, what kinds of problems crop up with them: the great Yehuda Katz wrote about encodings as they were being worked out in Ruby 1.9 and beyond:
Ruby 1.9 Encodings: A Primer and the Solution for Rails
Encodings, Unabridged

Related

Within how many lines from the beginning do various directions have to be placed?

There are various directions following a comment character # that are interpreted in a specific way.
UNIX shebang: #!/usr/bin/env ruby
magic comment for encoding (used in Ruby 1.9): #coding: UTF-8
frozen string literal pragma: #frozen_string_literal: true
ruby-mode direction for text editors (like emacs): #!ruby
vim encoding direction: #vim:set fileencoding=euc-jp
It is clear that they have to be placed near the beginning of the file to work correctly, but when there are more than one of them, they cannot all be placed on the first line. Within how many lines from the beginning of the file do they have to be placed? Is the relative order between them relevant? What are the rules that decide them?
If there are other than those I listed above, please add that.
(This is a community wiki answer. It's incomplete, so please add your findings.)
#!/usr/bin/env ruby
UNIX shebang
Must appear at the left margin on the first line of a shell script
Interpreted by kernel, some text editors use it to determine the file type
#coding: UTF-8
Magic comment for encoding
Must appear on the first line or on the second line if the first line is a shebang but can be formatted loosely, e.g. # -*- coding: utf-8 -*-
Interpreted by Ruby 1.9+
#frozen_string_literal: true
Frozen string literal pragma
Appearance?
Interpreted by Ruby 2.3
# -*- mode: ruby -*-
Emacs file-local variables
Must appear on the first line or on the second line if the first line is a shebang
Interpreted by Emacs
#vim:set fileencoding=euc-jp
Vim modeline
Must appear within the first five or last five lines, although this number is configurable with the modelines setting
Interpreted by Vim if the modeline setting is enabled (on by default except for root).

Zlib and utf-8 in ruby

I'm trying to use zlib to compress out some lengthy strings, some of which may contain unicode characters. At the moment, I'm doing this in ruby, but I think this would apply across any language really. Here's the super basic implementation:
require 'zlib'
example = "“hello world”" # note the unicode quotes
compressed = Zlib.deflate(example)
puts Zlib.inflate(compressed)
The issue here is that the text comes out as this:
\xE2\x80\x9Chello world\xE2\x80\x9
...no unicode quotes, just weird unrecognizable characters. Does anyone know of a way that Zlib can be used while retaining unicode characters? Bonus points for an answer in ruby : )
It seems Zlib produces ASCII-8BIT as the default encoding upon inflating. To fix it just force the original encoding:
require 'zlib'
input = "“hello world”"
compressed = Zlib.deflate(input)
output = Zlib.inflate(compressed).force_encoding(input.encoding)
Or set the encoding manually:
output = Zlib.inflate(compressed).force_encoding('utf-8')

why does my vim encoding and ruby encoding not agree?

In the ruby file:
p __ENCODING__
#<Encoding:US-ASCII>
In vim:
set encoding?
encoding=utf-8
This is causing me grief (http://stackoverflow.com/questions/14495486/ruby-syntax-error-with-multiple-language-in-hash), which is patched but I still don't understand why the file shows as ASCII by ruby and utf-8 by vim.
As #melpomene commented, :set encoding tells you what encoding is used internally by Vim.
:set fileencoding will tell you what encoding Vim decided to use for your document. The possible values are given by the fileencodings option. ASCII is not part of the default list as it's usually handled transparently by the other encodings listed.
But that part of your question is puzzling me:
but I still don't understand why the file is ASCII
because it looks like you actively want that file to be treated as ASCII by the interpreter.
Anyway, that encoding directive is only used by Ruby: it doesn't mean that the file is actually encoded as ASCII or that Vim is supposed to care about it and treat it in a special way.
In short, whether your file is actually encoded in ASCII or not, Vim doesn't care.
So… what do you want exactly? That vim sets its fileencoding option to ASCII when you open a supposedly ASCII file? That your supposedly ASCII file be converted to another encoding?
edit
With that directive, you explicitely tell Ruby that the file's content must be treated as ASCII and Ruby says "OK, that's ASCII, if you say so.".
This directive doesn't change anything to the actual encoding of the file. It could be utf-8, latin1 or whatever.
Vim doesn't understand that directive.
Vim chooses the encoding it uses for that file according to a number of rules you should read about in :h encoding, :h fileencoding and :h fileencodings.
Vim doesn't treat ASCII in a special "ASCII" way, it just handles it has the subset of utf-8 that it is.
So, before we go further, please verify:
the encoding of the file with something like $ file /path/to/file
the fileencoding Vim uses for that file with :set fileencoding

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Resources